2025-11-19T22:25:14.098458

Translation Entropy: A Statistical Framework for Evaluating Translation Systems

Gross, Harel, Kanter

The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.

academic

Translation Entropy: A Statistical Framework for Evaluating Translation Systems

Basic Information

Paper ID: 2511.13180
Title: Translation Entropy: A Statistical Framework for Evaluating Translation Systems
Authors: Ronit D. Gross, Yanir Harel, Ido Kanter (Bar-Ilan University)
Classification: cs.CL (Computational Linguistics)
Publication Date: 2025
Paper Link: https://arxiv.org/abs/2511.13180

Abstract

This research addresses the lack of objective quantitative evaluation methods for machine translation systems by proposing a statistical framework for estimating Translation Entropy (TE). The core finding is that given a translator, multiple source sentences differing in only one selected token may produce identical translations. By analyzing the statistical properties of this phenomenon, one can compute the probability distribution of replacing specific tokens while maintaining translation invariance, thereby obtaining the entropy value for that token. Averaging entropy values across all selected tokens yields an estimate of the translator's overall translation entropy. The method enables ranking of multiple public translators, reveals symmetry properties of mutual translation entropy, and discovers multiplicative effects in dual-token replacement. The research is validated using three translation models: MarianMT, T5-Base, and NLLB-200.

Research Background and Motivation

1. Core Problem to Address

Machine translation systems (particularly deep learning-based encoder-decoder architectures) lack objective quantitative evaluation methods. Although metrics such as BLEU and COMET exist, they primarily rely on lexical and semantic similarity to reference translations, making it difficult to measure the intrinsic properties of translators from an information-theoretic perspective.

2. Problem Significance

Theoretical Level: The entropy value of a single language cannot be precisely calculated to date. Shannon estimated English entropy at approximately 1 bit per letter in 1951, but extending this to longer text sequences is computationally infeasible.
Practical Level: With increasing translation demands in the information age, objective methods are needed to evaluate and compare the performance of different translation systems.
Scientific Significance: Understanding information degradation phenomena in translation processes and revealing intrinsic relationships between languages.

3. Limitations of Existing Methods

BLEU: Based on n-gram matching, unable to recognize translations with different wording but identical meaning.
COMET: Although using neural models to understand semantics, it still depends on reference translations and shows small score variance (see Table 8).
Theoretical Dilemma: Theoretical estimation of language entropy remains unsolved; translation entropy is even more complex.

4. Research Motivation

Propose a method to estimate translation entropy without needing to know individual language entropies, quantifying the "translation degeneracy" phenomenon from an information-theoretic perspective.

Core Contributions

Proposes a computable definition of Translation Entropy (TE): Quantifies translation entropy through probability distributions of token replacements that maintain translation invariance.
Develops a systematic TE estimation method: Includes complete workflow of pivot sentence selection, token replacement, subgroup statistics, and entropy calculation.
Discovers multiplicative effects of translation degeneracy: Dual-token replacement degeneracy is approximately 0.5-0.9 times the product of single-token degeneracies.
Reveals asymmetry in mutual translation entropy: English-French translation shows significant asymmetry (French→English entropy approximately 2.5 times English→French), while English-Hebrew translation is approximately symmetric.
Quantitatively ranks three mainstream translators: MarianMT, T5-Base, and NLLB-200, discovering non-monotonic relationships between model size and performance.
Verifies entropy-decreasing patterns across decoder blocks: Translation quality progressively improves along decoder layers (entropy decreases from 10,712 to 116).

Methodology Details

Task Definition

Input: Encoder-decoder translation model, source language dataset
Output: Translation entropy value S (or S₉₅) quantifying the translator's translation degeneracy
Constraints: Requires sufficient source sentences containing selected tokens (this study uses 30 pivot sentences)

Model Architecture

Overall Workflow

Translation entropy estimation consists of the following steps:

Step 1: Single-Token Analysis

Select a pivot token T₁
Select 30 source sentences containing T₁ from the training dataset (at position j)
For each sentence, replace T₁ at position j with all possible tokens (~30,000)
Identify which replacements produce translations identical to the original pivot sentence

Step 2: Subgroup Construction

For each pivot sentence m, construct subgroup SG_m(T₁) containing all replacement tokens that preserve translation invariance
To avoid anomalously large subgroups (e.g., when models ignore certain tokens, nearly all tokens become replaceable), retain only the 24 smallest subgroups, denoted SG₂₄(T₁)

Step 3: Probability Calculation Count occurrences of each token i in SG₂₄(T₁) (1-24 occurrences), divided by 24 to obtain probability P_i:

P_i = (number of occurrences of token i in 24 subgroups) / 24

Step 4: Entropy Calculation For single-token entropy: $S(T_1) = -\sum_i P_i \log_2 P_i \quad \text{(Eq. 2)}$

Average number of replacements: $N_{Av}(T_1) = 24 \sum_i P_i \quad \text{(Eq. 1)}$

Step 5: Threshold Filtering To exclude meaningless low-probability replacements (gibberish tokens), apply threshold: $P_i > \text{Threshold} = \frac{\beta_c}{24} \quad \text{(Eq. 4)}$ The study uses β_c = 5 (i.e., P_i > 0.208)

Step 6: Overall Entropy Estimation Repeat the above process for 100 randomly selected pivot tokens and calculate average entropy: $S = \langle S(T_\alpha) \rangle_\alpha \quad \text{(Eq. 5)}$

To reduce the impact of outliers, use S₉₅ (average of only the 95 lowest entropy values)

Technical Innovations

1. Conditional Degeneracy Measurement

Unlike traditional "token replacement within a specific sentence," this method measures "across multiple sentences containing the token, which tokens consistently preserve translation invariance," representing a stronger conditional constraint.

2. Rationality of Threshold Design

By analyzing the distribution characteristics of P_i:

P_i = 1: Strong synonyms, entropy contribution is 0
P_i ≈ 0.37 (1/e): Maximum entropy contribution
P_i ≪ 0.37: Noise tokens, require filtering

Threshold β_c = 5 corresponding to P_i ≈ 0.208 balances retaining meaningful replacements and filtering noise.

3. Dual-Token Multiplicative Effect

Translation degeneracy satisfies an approximate multiplicative relationship: $SG(T_\alpha, T_\beta) > 0.5 \cdot SG(T_\alpha) \cdot SG(T_\beta) \quad \text{(Eq. 6)}$

Coefficients of 0.5-0.9 indicate semantic correlations between tokens, suggesting translation does not process each token completely independently.

4. Distinction from Baselines

vs BLEU: Does not depend on reference translations; measures model's intrinsic information degradation.
vs COMET: Quantifies from information-theoretic perspective rather than semantic similarity.
vs Language Entropy Estimation: Circumvents computational difficulties of single-language entropy, directly measuring entropy of translation mapping.

Experimental Setup

Datasets

MarianMT Training Data: Opus100 dataset containing approximately 1 million training sentences and 2,000 validation sentences
Language Pairs: English-French (approximately 30,000 tokens each), English-Hebrew
Pivot Sentence Selection:
- 30 source sentences containing each pivot token
- Token frequency range: 500-1,500 occurrences (excluding overly frequent conjunctions and rare words)
- Sentence length: Maximum 128 tokens

Evaluation Metrics

S: Average entropy of 100 pivot tokens
S₉₅: Average of 95 lowest entropy values (primary metric, excluding outliers)
N_Av: Average number of replacements
|SG|: Subgroup size

Comparison Methods

Translation Models:
- MarianMT (Helsinki-NLP/opus-mt): 6 encoder + 6 decoder blocks, ~75M parameters
- T5-Base (Google): 12 encoder + 12 decoder blocks, ~223M parameters
- NLLB-200 (Facebook): 12 encoder + 12 decoder blocks, ~615M parameters
Traditional Metrics: BLEU and COMET scores

Implementation Details

Number of Pivot Tokens: 100 randomly selected
Sentences per Token: 30
Number of Subgroups: Retain 24 smallest subgroups
Threshold: β_c = 5 (main results), β_c = 9 (robustness verification)
Decoder Block Analysis: Freeze first m blocks, train fully connected layer (50 epochs, CosineAnnealingLR, learning rate 1e-4)

Experimental Results

Main Results

1. English-French Mutual Translation Asymmetry (MarianMT)

Direction	S	S₉₅
EN→FR	29.5	3.6
FR→EN	20.7	9.5

Finding: FR→EN S₉₅ is 2.6 times EN→FR, showing significant asymmetry

2. English-Hebrew Mutual Translation Symmetry (MarianMT)

Direction	S	S₉₅
EN→HE	8.0	5.7
HE→EN	17.5	6.3

Finding: S₉₅ values are close (5.7 vs 6.3), showing approximate symmetry

3. Ranking of Three Translators (EN→FR)

Model	S	S₉₅	Parameters
MarianMT	29.5	3.6	~75M
NLLB-200	73.5	13.0	~615M
T5-Base	90.9	2.8	~223M

Finding: T5-Base performs best on S₉₅, followed by MarianMT; NLLB-200 with the most parameters performs worst

4. Ranking of Three Translators (FR→EN)

Model	S	S₉₅
MarianMT	20.7	9.5
NLLB-200	251.2	108.9
T5-Base	394.0	295.9

Finding: MarianMT significantly outperforms the other two models

5. Comparison with Traditional Metrics

Model	EN→FR BLEU	EN→FR COMET	FR→EN BLEU	FR→EN COMET
MarianMT	38.83	0.8026	39.82	0.8223
NLLB-200	33.27	0.798	34.38	0.8037
T5-Base	37.08	0.7763	28.19	0.7299

Observations:

MarianMT leads across all BLEU and COMET metrics
TE ranking partially aligns with COMET/BLEU (FR→EN) but differs for EN→FR
COMET scores show small variance (0.72-0.82), providing less discrimination than TE

Ablation Studies

1. Threshold Robustness Verification

S₉₅ values using β_c = 9:

EN→FR: MarianMT (1.5), NLLB-200 (2.8), T5-Base (1.1)
FR→EN: MarianMT (2.8), NLLB-200 (6.5), T5-Base (3.9)

Conclusion: Ranking order remains unchanged; method is robust to threshold selection

2. Translation Noise Analysis Without Threshold (β_c = 0)

Direction	MarianMT	NLLB-200	T5-Base
EN→FR S₉₅	116.1	1,374.3	258.6
FR→EN S₉₅	379.9	2,840.6	1,176.9

Finding:

Entropy values increase significantly (approximately 30-100 fold)
Ranking trends align with threshold-based cases
Validates existence of translation noise and necessity of threshold filtering

3. Entropy Reduction Across Decoder Blocks

Decoder Block	1	2	3	4	5	6
S₉₅	10,712	6,114	3,295	908	147	116

Conclusion: Translation quality progressively improves across decoder layers with exponential entropy decrease

Case Studies

Case 1: Low-Entropy Token "Nice" (S ≈ 2)

Pivot Sentence Examples:

"Nice to meet you"
"That's a Nice idea"

High-Probability Replacement Tokens:

"nice" (P ≈ 0.96)
"lovey" (P ≈ 0.42)

Low-Probability Noise Tokens:

"jug", "broad", "ese" (P ≈ 1/24)

Explanation: Proper nouns or specific vocabulary with few replacement options result in low entropy

Case 2: High-Entropy Token "buy" (S ≈ 14)

Characteristics: Many tokens with P_i > Threshold

"purchase", "get", "acquire", "obtain" and other synonyms
More semantic equivalence replacement options

Explanation: Common verb with rich synonymy results in high entropy

Case 3: Dual-Token Multiplicative Effect

Source Sentence: "You seemed very much in love, your arms full of wine and food"

SG(wine) = 86
SG(food) = 26
SG(wine, food) = 1,132
Ratio: 1,132 / (86 × 26) = 0.51

Explanation: Token replacements show correlation (e.g., "wine and beer" is more common than "wine and bread"), resulting in actual degeneracy slightly less than theoretical product

Experimental Findings

Long-Tail Distribution of Entropy Values: Most tokens have S(T_α) in the 1-13 range, but rare outliers reach hundreds (Fig. 4)
Intrinsic Differences Between Language Pairs: English-French asymmetry may stem from language structure differences (e.g., stricter gender-number agreement in French) rather than model defects
Non-Monotonic Relationship Between Model Size and Performance: MarianMT (75M) outperforms NLLB-200 (615M) on certain tasks, indicating architecture design and training data quality matter more than parameter count
Universality of Translation Degeneracy: All translators exhibit significant translation degeneracy (S₉₅ > 2.8), reflecting inherent synonymy in natural language
COMET's Discrimination Problem: COMET scores cluster in narrow range 0.72-0.82, while TE's S₉₅ spans 2.8-295.9, providing greater discrimination

1. Theoretical Research on Language Entropy

Shannon (1951): Estimated English entropy at approximately 1 bit/letter through human prediction experiments
Limitations: Cannot extend to N > 10 sequences; requires exponential data quantities

2. Machine Translation Evaluation Metrics

BLEU (Papineni et al., 2002): Based on exact n-gram matching, ignores semantic equivalence
COMET (Rei et al., 2020): Uses neural networks to assess semantic similarity but still depends on reference translations
This Paper's Advantages: No reference translations needed; directly quantifies translator characteristics from information-theoretic perspective

3. Deep Learning Translation Models

Transformer Architecture (Vaswani et al., 2017): Encoder-decoder structure became mainstream
MarianMT (Junczys-Dowmunt et al., 2018): Efficient C++ implementation
T5 (Raffel et al., 2020): Unified text-to-text framework
NLLB-200 (Koishekenov et al., 2022): Large-scale multilingual translation

4. Internal Mechanisms of Translation Systems

This Paper's Contribution: First quantification of layer-wise translation improvement across decoder blocks (Table 7)
Related Research: Gross et al. (2025) and Koresh et al. (2025) on Transformer learning mechanisms

Conclusions and Discussion

Main Conclusions

Translation Entropy is Measurable: Through statistical analysis of token replacements maintaining translation invariance, translator entropy can be quantified
Mutual Translation Entropy May Be Asymmetric: English-French translation shows 2.6-fold asymmetry, while English-Hebrew is approximately symmetric, indicating intrinsic structural differences between language pairs
Dual-Token Multiplicative Law: SG(T_α, T_β) ≈ 0.5-0.9 × SG(T_α) × SG(T_β), revealing semantic correlations between tokens
Non-Linear Relationship Between Model Size and Performance: MarianMT (75M parameters) outperforms NLLB-200 (615M parameters) on certain tasks
Progressive Optimization Across Decoder Layers: Translation entropy decreases exponentially across decoder layers (from 10,712 to 116)

Limitations

1. Methodological Level

Entropy Ambiguity: Different P_i distributions may produce identical entropy values; requires combining |SG| and N_Av for comprehensive interpretation
Sample Size Constraints: Using only 100 pivot tokens and 30 sentences; statistical robustness needs improvement
Computational Complexity: Dual-token analysis tested on only ~100 sentences due to combinatorial explosion

2. Theoretical Level

Unknown Optimal Entropy: Cannot determine language's minimum achievable entropy; only relative comparison possible
Inevitability of Synonymy: Zero entropy unrealistic due to inherent synonymy in natural language
Unclear Asymmetry Source: Cannot distinguish whether asymmetry stems from language structure or model training

3. Experimental Level

Dataset Dependency: Results based on Opus100; other datasets may produce different results
Limited Language Pairs: Only English-French and English-Hebrew tested; broader coverage needed
Threshold Selection: While results robust within β_c = 5-10 range, optimal value still needs theoretical guidance

Future Directions

Extension to More Language Pairs: Construct language clusters; distinguish symmetric/asymmetric mutual translation characteristics
Pre-training for High-Entropy Tokens: Develop specialized training strategies for tokens with S(T_α) > 10
Estimation of Theoretical Minimum Entropy: Explore entropy lower bounds for given language pairs
Relationship with Model Architecture: Study effects of encoder/decoder layer count, attention heads, etc. on TE
Online TE Estimation: Develop incremental estimation methods without requiring complete training datasets
Multi-Token Extension: Investigate higher-order correlations for three or more token replacements

In-Depth Evaluation

Strengths

1. Methodological Innovation (★★★★★)

Paradigm Shift: First information-theoretic definition of computable translation entropy, circumventing single-language entropy estimation difficulties
Theoretical Depth: Combines Shannon entropy theory with modern deep learning, bridging statistical physics and NLP
Universality: Applicable to any encoder-decoder architecture, not limited to specific models

2. Experimental Sufficiency (★★★★☆)

Multi-Model Validation: Tests three mainstream translators (MarianMT, T5-Base, NLLB-200)
Multi-Language Pairs: English-French, French-English, English-Hebrew, Hebrew-English four directions
Complete Ablation Studies: Threshold robustness, no-threshold comparison, decoder block analysis
Limitation: Relatively limited pivot token count (100) and sentence count (30)

3. Result Convincingness (★★★★☆)

Important Findings:
- Mutual translation asymmetry (2.6-fold English-French difference)
- Dual-token multiplicative effect (coefficient 0.5-0.9)
- Decoder entropy reduction pattern (exponential decrease)
Comparison with Traditional Metrics: TE partially aligns with BLEU/COMET but provides new perspective
Limitation: Not validated on larger-scale datasets (e.g., WMT)

4. Writing Clarity (★★★★★)

Rigorous Structure: From historical background → problem definition → method design → experimental validation; clear logic
Excellent Visualization: Figures 1-6 intuitively present concepts and results
Standard Mathematical Expression: Clear formula derivations and well-defined symbols

Weaknesses

1. Missing Statistical Significance Testing

No confidence intervals or standard deviations provided for S₉₅
Is 100 pivot tokens sufficient? Requires bootstrap validation

2. Insufficient Analysis of COMET/BLEU Contradictions

EN→FR: TE ranking T5-Base > MarianMT, but BLEU/COMET ranking opposite (Table 2 vs Table 8)
Only notes differences without exploring underlying causes (e.g., TE measures degeneracy rather than translation quality?)

3. Missing Computational Cost Analysis

Single token TE estimation requires 30×30,000 = 900,000 translations
100 tokens require 90 million translations total; enormous computational cost
No discussion of complexity reduction strategies

4. Insufficient Theoretical Explanation

Why is English-French asymmetric while English-Hebrew symmetric? Only speculates "language structure differences"
What is the theoretical predicted value for dual-token coefficient 0.5-0.9?
What is the optimal distribution form for P_i?

5. Potential Experimental Design Biases

Pivot token frequency selection (500-1,500) may introduce mid-frequency word bias
Can 30 sentences represent all token usages?
Only uses training set sentences; generalization ability untested

Impact

1. Contribution to Field (★★★★☆)

Theoretical Contribution: Establishes operational definition of translation entropy, providing new dimension for translation system evaluation
Methodological Contribution: Token replacement + statistical analysis paradigm extensible to other NLP tasks (text generation, summarization)
Empirical Contribution: Reveals mutual translation asymmetry and decoder optimization mechanisms

2. Practical Value (★★★☆☆)

Advantages:
- No need for manual reference translation annotation
- Provides greater discrimination than COMET
- Usable for model selection and hyperparameter tuning
Limitations:
- High computational cost (90 million translations/100 tokens)
- Requires model internal access (cannot evaluate API translation services)
- Correlation with human evaluation unverified

3. Reproducibility (★★★★☆)

Strengths:
- Detailed method description (algorithm steps, hyperparameters, datasets)
- Uses public datasets (Opus100) and models (MarianMT, etc.)
Weaknesses:
- No code link provided
- Specific selection of 100 pivot tokens not disclosed
- Selection criteria for 30 sentences unclear

Applicable Scenarios

1. Ideal Scenarios

Model Development: Compare translation degeneracy characteristics of different architectures (encoder/decoder layer count, attention mechanisms)
Linguistic Research: Study language pair symmetry; construct language clustering based on TE
Training Optimization: Identify high-entropy tokens; design targeted training strategies

2. Inapplicable Scenarios

Real-Time Evaluation: Computational cost too high for instant evaluation in online translation systems
Black-Box APIs: Requires model internal access; cannot evaluate GPT-4 and similar API services
Low-Resource Languages: Requires sufficient training data for pivot sentence selection

3. Potential Extensions

Text Generation: Evaluate generation diversity of GPT-class models (generation degeneracy)
Summarization Systems: Measure information compression rate from source text to summary
Dialogue Systems: Quantify semantic equivalence class size of responses

Key References

Shannon, C.E. (1951): Prediction and entropy of printed English - Pioneering work on language entropy
Vaswani et al. (2017): Attention is all you need - Transformer architecture
Papineni et al. (2002): BLEU metric - Classical translation evaluation metric
Rei et al. (2020): COMET - Neural translation evaluation framework
Raffel et al. (2020): T5 - Unified text-to-text Transformer

Summary

The translation entropy framework proposed in this paper represents an important innovation in machine translation evaluation, providing a novel information-theoretic perspective. Its core advantages lie in requiring no reference translations and providing greater discrimination, while core findings (mutual translation asymmetry, dual-token multiplicative effect, decoder entropy reduction) hold significant theoretical and practical importance. However, high computational cost, insufficient theoretical explanation, and unexplored contradictions with traditional metrics are main limitations. Future work reducing computational complexity, extending to more language pairs, and deeply analyzing asymmetry sources could establish this method as a standard tool for translation system evaluation.

Recommendation Index: ★★★★☆ (4/5)
Suitable Readers: Machine translation researchers, scholars in information theory and NLP intersection, translation system developers