Translation Entropy: A Statistical Framework for Evaluating Translation Systems
Gross, Harel, Kanter
The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.
academic
Translation Entropy: A Statistical Framework for Evaluating Translation Systems
This research addresses the lack of objective quantitative evaluation methods for machine translation systems by proposing a statistical framework for estimating Translation Entropy (TE). The core finding is that given a translator, multiple source sentences differing in only one selected token may produce identical translations. By analyzing the statistical properties of this phenomenon, one can compute the probability distribution of replacing specific tokens while maintaining translation invariance, thereby obtaining the entropy value for that token. Averaging entropy values across all selected tokens yields an estimate of the translator's overall translation entropy. The method enables ranking of multiple public translators, reveals symmetry properties of mutual translation entropy, and discovers multiplicative effects in dual-token replacement. The research is validated using three translation models: MarianMT, T5-Base, and NLLB-200.
Machine translation systems (particularly deep learning-based encoder-decoder architectures) lack objective quantitative evaluation methods. Although metrics such as BLEU and COMET exist, they primarily rely on lexical and semantic similarity to reference translations, making it difficult to measure the intrinsic properties of translators from an information-theoretic perspective.
Theoretical Level: The entropy value of a single language cannot be precisely calculated to date. Shannon estimated English entropy at approximately 1 bit per letter in 1951, but extending this to longer text sequences is computationally infeasible.
Practical Level: With increasing translation demands in the information age, objective methods are needed to evaluate and compare the performance of different translation systems.
Scientific Significance: Understanding information degradation phenomena in translation processes and revealing intrinsic relationships between languages.
Propose a method to estimate translation entropy without needing to know individual language entropies, quantifying the "translation degeneracy" phenomenon from an information-theoretic perspective.
Proposes a computable definition of Translation Entropy (TE): Quantifies translation entropy through probability distributions of token replacements that maintain translation invariance.
Develops a systematic TE estimation method: Includes complete workflow of pivot sentence selection, token replacement, subgroup statistics, and entropy calculation.
Discovers multiplicative effects of translation degeneracy: Dual-token replacement degeneracy is approximately 0.5-0.9 times the product of single-token degeneracies.
Reveals asymmetry in mutual translation entropy: English-French translation shows significant asymmetry (French→English entropy approximately 2.5 times English→French), while English-Hebrew translation is approximately symmetric.
Quantitatively ranks three mainstream translators: MarianMT, T5-Base, and NLLB-200, discovering non-monotonic relationships between model size and performance.
Verifies entropy-decreasing patterns across decoder blocks: Translation quality progressively improves along decoder layers (entropy decreases from 10,712 to 116).
Translation entropy estimation consists of the following steps:
Step 1: Single-Token Analysis
Select a pivot token T₁
Select 30 source sentences containing T₁ from the training dataset (at position j)
For each sentence, replace T₁ at position j with all possible tokens (~30,000)
Identify which replacements produce translations identical to the original pivot sentence
Step 2: Subgroup Construction
For each pivot sentence m, construct subgroup SG_m(T₁) containing all replacement tokens that preserve translation invariance
To avoid anomalously large subgroups (e.g., when models ignore certain tokens, nearly all tokens become replaceable), retain only the 24 smallest subgroups, denoted SG₂₄(T₁)
Step 3: Probability Calculation
Count occurrences of each token i in SG₂₄(T₁) (1-24 occurrences), divided by 24 to obtain probability P_i:
P_i = (number of occurrences of token i in 24 subgroups) / 24
Step 4: Entropy Calculation
For single-token entropy:
S(T1)=−∑iPilog2Pi(Eq. 2)
Average number of replacements:
NAv(T1)=24∑iPi(Eq. 1)
Step 5: Threshold Filtering
To exclude meaningless low-probability replacements (gibberish tokens), apply threshold:
Pi>Threshold=24βc(Eq. 4)
The study uses β_c = 5 (i.e., P_i > 0.208)
Step 6: Overall Entropy Estimation
Repeat the above process for 100 randomly selected pivot tokens and calculate average entropy:
S=⟨S(Tα)⟩α(Eq. 5)
To reduce the impact of outliers, use S₉₅ (average of only the 95 lowest entropy values)
Unlike traditional "token replacement within a specific sentence," this method measures "across multiple sentences containing the token, which tokens consistently preserve translation invariance," representing a stronger conditional constraint.
Source Sentence: "You seemed very much in love, your arms full of wine and food"
SG(wine) = 86
SG(food) = 26
SG(wine, food) = 1,132
Ratio: 1,132 / (86 × 26) = 0.51
Explanation: Token replacements show correlation (e.g., "wine and beer" is more common than "wine and bread"), resulting in actual degeneracy slightly less than theoretical product
Long-Tail Distribution of Entropy Values: Most tokens have S(T_α) in the 1-13 range, but rare outliers reach hundreds (Fig. 4)
Intrinsic Differences Between Language Pairs: English-French asymmetry may stem from language structure differences (e.g., stricter gender-number agreement in French) rather than model defects
Non-Monotonic Relationship Between Model Size and Performance: MarianMT (75M) outperforms NLLB-200 (615M) on certain tasks, indicating architecture design and training data quality matter more than parameter count
Universality of Translation Degeneracy: All translators exhibit significant translation degeneracy (S₉₅ > 2.8), reflecting inherent synonymy in natural language
COMET's Discrimination Problem: COMET scores cluster in narrow range 0.72-0.82, while TE's S₉₅ spans 2.8-295.9, providing greater discrimination
Translation Entropy is Measurable: Through statistical analysis of token replacements maintaining translation invariance, translator entropy can be quantified
Mutual Translation Entropy May Be Asymmetric: English-French translation shows 2.6-fold asymmetry, while English-Hebrew is approximately symmetric, indicating intrinsic structural differences between language pairs
The translation entropy framework proposed in this paper represents an important innovation in machine translation evaluation, providing a novel information-theoretic perspective. Its core advantages lie in requiring no reference translations and providing greater discrimination, while core findings (mutual translation asymmetry, dual-token multiplicative effect, decoder entropy reduction) hold significant theoretical and practical importance. However, high computational cost, insufficient theoretical explanation, and unexplored contradictions with traditional metrics are main limitations. Future work reducing computational complexity, extending to more language pairs, and deeply analyzing asymmetry sources could establish this method as a standard tool for translation system evaluation.
Recommendation Index: ★★★★☆ (4/5) Suitable Readers: Machine translation researchers, scholars in information theory and NLP intersection, translation system developers