2025-11-21T23:25:16.078828

Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

Fan, Yang, Kankanhalli et al.
When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named α-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including α-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.
academic

Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

Basic Information

  • Paper ID: 2510.10060
  • Title: Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling
  • Authors: Hehe Fan (Zhejiang University), Yi Yang (Zhejiang University), Mohan Kankanhalli (National University of Singapore), Fei Wu (Zhejiang University)
  • Classification: cs.LG cs.AI cs.CL cs.CV
  • Publication Date: October 11, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10060v1

Abstract

The authors argue that data modeling involves two key aspects: 1) identifying elements related to a central element (such as a convolutional receptive field) or a query element (such as self-attention); 2) effectively encoding these tokens. Self-attention can adaptively identify these elements but relies on absolute positional embeddings for structural representation learning. In contrast, convolution encodes elements in a relative manner, but fixed kernel sizes limit its ability to adaptively select relevant elements. This paper proposes the Translution operation, which unifies the adaptive identification capability of self-attention with the relative encoding advantages of convolution. However, this integration results in a substantial increase in parameters, exceeding the computational resources available to most institutions. Therefore, the authors propose a lightweight variant, α-Translution. Experiments demonstrate that Translution outperforms self-attention on both computer vision and natural language processing tasks.

Research Background and Motivation

Problem Definition

The core challenge facing current deep learning is how to effectively model data. The authors decompose data modeling into two key aspects:

  1. Relevant Element Identification: Determining which data elements are relevant to the currently processed element
  2. Effective Encoding: Encoding these relevant elements into effective representations

Limitations of Existing Methods

Limitations of Convolutional Neural Networks:

  • Use fixed-size kernels to define local receptive fields
  • Cannot avoid including irrelevant pixels, particularly at object boundaries or background regions
  • While capable of encoding local structure relatively, lack adaptivity

Limitations of Self-attention Mechanisms:

  • Can adaptively identify relevant regions without predefined locality constraints
  • Rely on absolute positional embeddings to capture structural information
  • May struggle to recognize identical objects when they move to different positions

Research Motivation

As direct extensions of models like Transformers encounter diminishing returns, AI laboratories have observed that improvement rates for next-generation models fall short of expectations. Under data saturation and current scaling law limitations, designing innovative neural network architectures has become critical.

Core Contributions

  1. Proposes Translution Operation: Unifies the adaptive identification capability of self-attention with the relative encoding advantages of convolution
  2. Designs α-Translution Lightweight Variant: Significantly reduces parameter count, making the approach feasible with current computational resources
  3. Theoretical Unification: Proves that convolution and self-attention can be viewed as special cases of Translution
  4. Experimental Validation: Validates the method's effectiveness on computer vision and natural language processing tasks
  5. Open-source Implementation: Provides complete code implementation for community use

Method Details

Task Definition

Given input data (image patches or text tokens), the objective is to learn an operation that can:

  • Adaptively identify elements relevant to the query element
  • Encode structural relationships of these elements in a relative manner
  • Generate effective output representations

Model Architecture

Translution Operation

Translution adopts a convolution-style approach, assigning different parameter matrices for each distance and direction:

Relative query encoding: qi,j = fi · W^q_{δx,δy}, δx = xi - xj, δy = yi - yj
Relative key encoding: kj,i = fj · W^k_{-δx,-δy}
Relative attention: ai,j = (qi,j · k^T_{j,i})/√C', αi,j = e^{ai,j}/∑e^{ai,n}
Relative value encoding: vi,j = fj · W^v_{δx,δy}
Weighted summation: f'i = ∑αi,j × vi,j

where W^q_{δx,δy}, W^k_{δx,δy}, W^v_{δx,δy} ∈ R^{C×C'} are learnable parameter matrices corresponding to displacement (δx,δy).

α-Translution Lightweight Variant

Since Translution requires (2H-1)×(2W-1)×C×C' parameters, α-Translution reduces parameters by lowering input and output dimensions:

W^q_{δx,δy} ⇒ W^q_1 · W^q_{δx,δy}
W^k_{δx,δy} ⇒ W^k_1 · W^k_{δx,δy}
W^v_{δx,δy} ⇒ W^v_1 · W^v_{δx,δy} · W^v_2

where C1 ≪ C, C2 ≪ C'.

Technical Innovations

1. Theoretical Unification

The authors prove that convolution and self-attention are special cases of Translution:

  • Convolution: Attention weights are 1 within the receptive field and 0 outside
  • Self-attention: Uses shared W^q, W^k, W^v parameters, ignoring directional and distance encoding
  • Translution: Combines the advantages of both

2. Relative Position Encoding

Unlike existing methods (scalar bias or vector addition), Translution uses offset-based matrices for relative encoding, better capturing directional and distance information.

3. Memory-Optimized Implementation

A memory-efficient implementation for α-Translution is designed, reducing peak memory usage from N×N×C' to N×C'+N×N×C2.

Experimental Setup

Datasets

Computer Vision Tasks:

  • Dynamic MNIST: Synthetic dataset with digits moving within an 84×84 pixel region
  • Static MNIST: Control dataset with digits fixed at image center
  • ImageNet-1K: Large-scale image classification dataset with 1000 classes

Natural Language Processing Tasks:

  • OpenWebText: 9 billion training tokens, 4 million validation tokens, 50K vocabulary

Evaluation Metrics

  • Image Classification: Top-1 and Top-5 accuracy
  • Language Modeling: Perplexity

Comparison Methods

  • Standard self-attention (Transformer baseline)
  • Relative position encoding variants (Shaw et al., Swin Transformer, ConViT, RoFormer, etc.)
  • Absolute encoding variants (for ablation studies)

Implementation Details

  • Architecture configuration: 6-12 layers depth, 192-384 embedding dimensions, 3-6 attention heads
  • α-Translution default compression dimensions: C1 = C2 = 8
  • Batch size: 256 (ImageNet), 8 (OpenWebText)
  • All training from scratch without external pretraining

Experimental Results

Main Results

Dynamic MNIST Experiments

MethodParametersStatic→StaticDynamic→DynamicStatic→Dynamic
Self-attention2.7M98.48%92.64%18.18%
α-Translution4.6M98.48%97.31%34.90%
Translution116.2M98.60%97.35%36.40%

Key Finding: Translution performs significantly better in position-variation scenarios, demonstrating the advantages of relative encoding.

ImageNet-1K Experiments

Using ViT-A/56 as example:

MethodParametersTop-1Top-5
Self-attention4.7M46.28%71.17%
α-Translution5.3M48.36%73.31%
Translution38.5M52.41%76.50%

Natural Language Modeling Experiments

MethodParametersPerplexity
Self-attention22.0M60.40
α-Translution23.7M57.97
Translution127.5M56.26

Ablation Studies

1. Impact of Parameter Increase vs. Relative Encoding

Experiments show that simply increasing parameters (absolute encoding) does not yield performance improvements, validating the effectiveness of the relative encoding method itself.

2. Impact of Relative Encoding Dimensions

As C1 and C2 increase, α-Translution performance improves, but parameter count also increases, presenting an efficiency-effectiveness trade-off.

3. Comparison of Position Encoding Methods

MethodParametersTop-1Top-5
No positional embedding4.69M42.49%67.39%
Standard positional embedding4.69M46.28%71.17%
Swin Transformer4.69M46.36%71.31%
RoFormer4.69M46.65%71.51%
α-Translution5.33M48.36%73.31%
Translution38.53M52.41%76.50%

Experimental Findings

  1. Importance of Relative Encoding: In position-variation scenarios, relative encoding significantly outperforms absolute encoding
  2. Parameter Efficiency: α-Translution achieves significant performance improvements with modest parameter increases
  3. Cross-modal Effectiveness: The method is effective on both vision and language tasks
  4. Memory Constraints: Current GPU memory limitations restrict large-scale experiments, requiring 2-3TB memory for larger-scale evaluations

Position Encoding Research

The authors categorize related work into three types:

  1. Relative Position Vectors: Shaw et al., BoTNet, HaloNet, etc.
  2. Relative Position Scalars: Swin Transformer, CoAtNet, ConViT, etc.
  3. Rotary Position Embeddings: RoFormer, etc.

Combining Convolution and Attention

  • Architecture-level Combination: Conformer, CeiT, etc. use convolution and attention in different layers
  • Module-level Combination: Translution unifies both at the fundamental operation level

Conclusions and Discussion

Main Conclusions

  1. Translution successfully unifies the adaptive identification capability of self-attention with the relative encoding advantages of convolution
  2. α-Translution provides a good balance between parameter efficiency and performance
  3. Relative encoding significantly outperforms absolute encoding when handling position variations
  4. The method demonstrates improvements across multiple tasks and modalities

Limitations

  1. Computational Resource Requirements: Full Translution requires substantial parameters and memory
  2. Evaluation Scale Limitations: Due to resource constraints, evaluation is primarily on small-to-medium scale architectures
  3. Scenario-specific Optimization: Certain relative positions may share parameters, particularly at greater distances

Future Directions

  1. Optimized Variant Exploration: Design more efficient Translution variants
  2. Multi-modal Extension: Extend to other modalities such as 3D, video, and molecules
  3. Architecture Design: Design more effective specialized architectures for Translution
  4. Large-scale Evaluation: Validate on larger-scale frameworks and datasets

In-depth Evaluation

Strengths

  1. Theoretical Contribution: Provides a unified perspective on convolution and self-attention, theoretically elegant
  2. Practical Value: α-Translution delivers performance improvements even under resource constraints
  3. Comprehensive Experiments: Covers multiple tasks, datasets, and ablation studies
  4. Clear Problem Identification: Clearly identifies and addresses core limitations of existing methods
  5. Open-source Contribution: Provides complete implementation, promoting community research

Weaknesses

  1. Resource Requirements: Computational demands of the full method may limit practical applications
  2. Evaluation Scale: Lacks evaluation on large-scale models due to resource constraints
  3. Theoretical Analysis: Lacks in-depth theoretical analysis of convergence and optimization properties
  4. Comparison Fairness: Significant parameter differences with baselines may affect comparison fairness

Impact

  1. Academic Value: Provides new insights into combining attention mechanisms and convolution
  2. Practical Prospects: The practicality of α-Translution makes it likely to be adopted in real applications
  3. Inspirational Significance: May inspire further research on fundamental operation unification

Applicable Scenarios

  1. Position-sensitive Tasks: Particularly suitable for tasks requiring position-variation handling
  2. Structured Data: Effective on data with spatial or sequential structure such as images and text
  3. Resource-rich Environments: Full Translution suitable for scenarios with abundant computational resources
  4. Research Exploration: Provides new directions for fundamental architecture research

References

The paper cites important works in the deep learning field, including:

  • Original Transformer paper (Vaswani et al., 2017)
  • Vision Transformer (Dosovitskiy et al., 2021)
  • Related work on relative position encoding (Shaw et al., 2018; Liu et al., 2021, etc.)
  • Classical convolutional neural network works (LeCun et al., 1998; He et al., 2016, etc.)

Overall Assessment: This is a high-quality paper with contributions in both theory and practice. While it faces challenges regarding computational resource requirements, the proposed α-Translution variant effectively balances performance and efficiency. The paper provides a novel perspective on unifying fundamental operations in deep learning, possessing significant academic value and practical significance.