2025-11-19T22:25:14.098458

Translation Entropy: A Statistical Framework for Evaluating Translation Systems

Gross, Harel, Kanter
The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.
academic

Translation Entropy: A Statistical Framework for Evaluating Translation Systems

Basic Information

  • Paper ID: 2511.13180
  • Title: Translation Entropy: A Statistical Framework for Evaluating Translation Systems
  • Authors: Ronit D. Gross, Yanir Harel, Ido Kanter (Bar-Ilan University)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: 2025
  • Paper Link: https://arxiv.org/abs/2511.13180

Abstract

This research addresses the lack of objective quantitative evaluation methods for machine translation systems by proposing a statistical framework for estimating Translation Entropy (TE). The core finding is that given a translator, multiple source sentences differing in only one selected token may produce identical translations. By analyzing the statistical properties of this phenomenon, one can compute the probability distribution of replacing specific tokens while maintaining translation invariance, thereby obtaining the entropy value for that token. Averaging entropy values across all selected tokens yields an estimate of the translator's overall translation entropy. The method enables ranking of multiple public translators, reveals symmetry properties of mutual translation entropy, and discovers multiplicative effects in dual-token replacement. The research is validated using three translation models: MarianMT, T5-Base, and NLLB-200.

Research Background and Motivation

1. Core Problem to Address

Machine translation systems (particularly deep learning-based encoder-decoder architectures) lack objective quantitative evaluation methods. Although metrics such as BLEU and COMET exist, they primarily rely on lexical and semantic similarity to reference translations, making it difficult to measure the intrinsic properties of translators from an information-theoretic perspective.

2. Problem Significance

  • Theoretical Level: The entropy value of a single language cannot be precisely calculated to date. Shannon estimated English entropy at approximately 1 bit per letter in 1951, but extending this to longer text sequences is computationally infeasible.
  • Practical Level: With increasing translation demands in the information age, objective methods are needed to evaluate and compare the performance of different translation systems.
  • Scientific Significance: Understanding information degradation phenomena in translation processes and revealing intrinsic relationships between languages.

3. Limitations of Existing Methods

  • BLEU: Based on n-gram matching, unable to recognize translations with different wording but identical meaning.
  • COMET: Although using neural models to understand semantics, it still depends on reference translations and shows small score variance (see Table 8).
  • Theoretical Dilemma: Theoretical estimation of language entropy remains unsolved; translation entropy is even more complex.

4. Research Motivation

Propose a method to estimate translation entropy without needing to know individual language entropies, quantifying the "translation degeneracy" phenomenon from an information-theoretic perspective.

Core Contributions

  1. Proposes a computable definition of Translation Entropy (TE): Quantifies translation entropy through probability distributions of token replacements that maintain translation invariance.
  2. Develops a systematic TE estimation method: Includes complete workflow of pivot sentence selection, token replacement, subgroup statistics, and entropy calculation.
  3. Discovers multiplicative effects of translation degeneracy: Dual-token replacement degeneracy is approximately 0.5-0.9 times the product of single-token degeneracies.
  4. Reveals asymmetry in mutual translation entropy: English-French translation shows significant asymmetry (French→English entropy approximately 2.5 times English→French), while English-Hebrew translation is approximately symmetric.
  5. Quantitatively ranks three mainstream translators: MarianMT, T5-Base, and NLLB-200, discovering non-monotonic relationships between model size and performance.
  6. Verifies entropy-decreasing patterns across decoder blocks: Translation quality progressively improves along decoder layers (entropy decreases from 10,712 to 116).

Methodology Details

Task Definition

Input: Encoder-decoder translation model, source language dataset
Output: Translation entropy value S (or S₉₅) quantifying the translator's translation degeneracy
Constraints: Requires sufficient source sentences containing selected tokens (this study uses 30 pivot sentences)

Model Architecture

Overall Workflow

Translation entropy estimation consists of the following steps:

Step 1: Single-Token Analysis

  1. Select a pivot token T₁
  2. Select 30 source sentences containing T₁ from the training dataset (at position j)
  3. For each sentence, replace T₁ at position j with all possible tokens (~30,000)
  4. Identify which replacements produce translations identical to the original pivot sentence

Step 2: Subgroup Construction

  • For each pivot sentence m, construct subgroup SG_m(T₁) containing all replacement tokens that preserve translation invariance
  • To avoid anomalously large subgroups (e.g., when models ignore certain tokens, nearly all tokens become replaceable), retain only the 24 smallest subgroups, denoted SG₂₄(T₁)

Step 3: Probability Calculation Count occurrences of each token i in SG₂₄(T₁) (1-24 occurrences), divided by 24 to obtain probability P_i:

P_i = (number of occurrences of token i in 24 subgroups) / 24

Step 4: Entropy Calculation For single-token entropy: S(T1)=iPilog2Pi(Eq. 2)S(T_1) = -\sum_i P_i \log_2 P_i \quad \text{(Eq. 2)}

Average number of replacements: NAv(T1)=24iPi(Eq. 1)N_{Av}(T_1) = 24 \sum_i P_i \quad \text{(Eq. 1)}

Step 5: Threshold Filtering To exclude meaningless low-probability replacements (gibberish tokens), apply threshold: Pi>Threshold=βc24(Eq. 4)P_i > \text{Threshold} = \frac{\beta_c}{24} \quad \text{(Eq. 4)} The study uses β_c = 5 (i.e., P_i > 0.208)

Step 6: Overall Entropy Estimation Repeat the above process for 100 randomly selected pivot tokens and calculate average entropy: S=S(Tα)α(Eq. 5)S = \langle S(T_\alpha) \rangle_\alpha \quad \text{(Eq. 5)}

To reduce the impact of outliers, use S₉₅ (average of only the 95 lowest entropy values)

Technical Innovations

1. Conditional Degeneracy Measurement

Unlike traditional "token replacement within a specific sentence," this method measures "across multiple sentences containing the token, which tokens consistently preserve translation invariance," representing a stronger conditional constraint.

2. Rationality of Threshold Design

By analyzing the distribution characteristics of P_i:

  • P_i = 1: Strong synonyms, entropy contribution is 0
  • P_i ≈ 0.37 (1/e): Maximum entropy contribution
  • P_i ≪ 0.37: Noise tokens, require filtering

Threshold β_c = 5 corresponding to P_i ≈ 0.208 balances retaining meaningful replacements and filtering noise.

3. Dual-Token Multiplicative Effect

Translation degeneracy satisfies an approximate multiplicative relationship: SG(Tα,Tβ)>0.5SG(Tα)SG(Tβ)(Eq. 6)SG(T_\alpha, T_\beta) > 0.5 \cdot SG(T_\alpha) \cdot SG(T_\beta) \quad \text{(Eq. 6)}

Coefficients of 0.5-0.9 indicate semantic correlations between tokens, suggesting translation does not process each token completely independently.

4. Distinction from Baselines

  • vs BLEU: Does not depend on reference translations; measures model's intrinsic information degradation.
  • vs COMET: Quantifies from information-theoretic perspective rather than semantic similarity.
  • vs Language Entropy Estimation: Circumvents computational difficulties of single-language entropy, directly measuring entropy of translation mapping.

Experimental Setup

Datasets

  • MarianMT Training Data: Opus100 dataset containing approximately 1 million training sentences and 2,000 validation sentences
  • Language Pairs: English-French (approximately 30,000 tokens each), English-Hebrew
  • Pivot Sentence Selection:
    • 30 source sentences containing each pivot token
    • Token frequency range: 500-1,500 occurrences (excluding overly frequent conjunctions and rare words)
    • Sentence length: Maximum 128 tokens

Evaluation Metrics

  1. S: Average entropy of 100 pivot tokens
  2. S₉₅: Average of 95 lowest entropy values (primary metric, excluding outliers)
  3. N_Av: Average number of replacements
  4. |SG|: Subgroup size

Comparison Methods

  • Translation Models:
    • MarianMT (Helsinki-NLP/opus-mt): 6 encoder + 6 decoder blocks, ~75M parameters
    • T5-Base (Google): 12 encoder + 12 decoder blocks, ~223M parameters
    • NLLB-200 (Facebook): 12 encoder + 12 decoder blocks, ~615M parameters
  • Traditional Metrics: BLEU and COMET scores

Implementation Details

  • Number of Pivot Tokens: 100 randomly selected
  • Sentences per Token: 30
  • Number of Subgroups: Retain 24 smallest subgroups
  • Threshold: β_c = 5 (main results), β_c = 9 (robustness verification)
  • Decoder Block Analysis: Freeze first m blocks, train fully connected layer (50 epochs, CosineAnnealingLR, learning rate 1e-4)

Experimental Results

Main Results

1. English-French Mutual Translation Asymmetry (MarianMT)

DirectionSS₉₅
EN→FR29.53.6
FR→EN20.79.5

Finding: FR→EN S₉₅ is 2.6 times EN→FR, showing significant asymmetry

2. English-Hebrew Mutual Translation Symmetry (MarianMT)

DirectionSS₉₅
EN→HE8.05.7
HE→EN17.56.3

Finding: S₉₅ values are close (5.7 vs 6.3), showing approximate symmetry

3. Ranking of Three Translators (EN→FR)

ModelSS₉₅Parameters
MarianMT29.53.6~75M
NLLB-20073.513.0~615M
T5-Base90.92.8~223M

Finding: T5-Base performs best on S₉₅, followed by MarianMT; NLLB-200 with the most parameters performs worst

4. Ranking of Three Translators (FR→EN)

ModelSS₉₅
MarianMT20.79.5
NLLB-200251.2108.9
T5-Base394.0295.9

Finding: MarianMT significantly outperforms the other two models

5. Comparison with Traditional Metrics

ModelEN→FR BLEUEN→FR COMETFR→EN BLEUFR→EN COMET
MarianMT38.830.802639.820.8223
NLLB-20033.270.79834.380.8037
T5-Base37.080.776328.190.7299

Observations:

  • MarianMT leads across all BLEU and COMET metrics
  • TE ranking partially aligns with COMET/BLEU (FR→EN) but differs for EN→FR
  • COMET scores show small variance (0.72-0.82), providing less discrimination than TE

Ablation Studies

1. Threshold Robustness Verification

S₉₅ values using β_c = 9:

  • EN→FR: MarianMT (1.5), NLLB-200 (2.8), T5-Base (1.1)
  • FR→EN: MarianMT (2.8), NLLB-200 (6.5), T5-Base (3.9)

Conclusion: Ranking order remains unchanged; method is robust to threshold selection

2. Translation Noise Analysis Without Threshold (β_c = 0)

DirectionMarianMTNLLB-200T5-Base
EN→FR S₉₅116.11,374.3258.6
FR→EN S₉₅379.92,840.61,176.9

Finding:

  • Entropy values increase significantly (approximately 30-100 fold)
  • Ranking trends align with threshold-based cases
  • Validates existence of translation noise and necessity of threshold filtering

3. Entropy Reduction Across Decoder Blocks

Decoder Block123456
S₉₅10,7126,1143,295908147116

Conclusion: Translation quality progressively improves across decoder layers with exponential entropy decrease

Case Studies

Case 1: Low-Entropy Token "Nice" (S ≈ 2)

Pivot Sentence Examples:

  • "Nice to meet you"
  • "That's a Nice idea"

High-Probability Replacement Tokens:

  • "nice" (P ≈ 0.96)
  • "lovey" (P ≈ 0.42)

Low-Probability Noise Tokens:

  • "jug", "broad", "ese" (P ≈ 1/24)

Explanation: Proper nouns or specific vocabulary with few replacement options result in low entropy

Case 2: High-Entropy Token "buy" (S ≈ 14)

Characteristics: Many tokens with P_i > Threshold

  • "purchase", "get", "acquire", "obtain" and other synonyms
  • More semantic equivalence replacement options

Explanation: Common verb with rich synonymy results in high entropy

Case 3: Dual-Token Multiplicative Effect

Source Sentence: "You seemed very much in love, your arms full of wine and food"

  • SG(wine) = 86
  • SG(food) = 26
  • SG(wine, food) = 1,132
  • Ratio: 1,132 / (86 × 26) = 0.51

Explanation: Token replacements show correlation (e.g., "wine and beer" is more common than "wine and bread"), resulting in actual degeneracy slightly less than theoretical product

Experimental Findings

  1. Long-Tail Distribution of Entropy Values: Most tokens have S(T_α) in the 1-13 range, but rare outliers reach hundreds (Fig. 4)
  2. Intrinsic Differences Between Language Pairs: English-French asymmetry may stem from language structure differences (e.g., stricter gender-number agreement in French) rather than model defects
  3. Non-Monotonic Relationship Between Model Size and Performance: MarianMT (75M) outperforms NLLB-200 (615M) on certain tasks, indicating architecture design and training data quality matter more than parameter count
  4. Universality of Translation Degeneracy: All translators exhibit significant translation degeneracy (S₉₅ > 2.8), reflecting inherent synonymy in natural language
  5. COMET's Discrimination Problem: COMET scores cluster in narrow range 0.72-0.82, while TE's S₉₅ spans 2.8-295.9, providing greater discrimination

1. Theoretical Research on Language Entropy

  • Shannon (1951): Estimated English entropy at approximately 1 bit/letter through human prediction experiments
  • Limitations: Cannot extend to N > 10 sequences; requires exponential data quantities

2. Machine Translation Evaluation Metrics

  • BLEU (Papineni et al., 2002): Based on exact n-gram matching, ignores semantic equivalence
  • COMET (Rei et al., 2020): Uses neural networks to assess semantic similarity but still depends on reference translations
  • This Paper's Advantages: No reference translations needed; directly quantifies translator characteristics from information-theoretic perspective

3. Deep Learning Translation Models

  • Transformer Architecture (Vaswani et al., 2017): Encoder-decoder structure became mainstream
  • MarianMT (Junczys-Dowmunt et al., 2018): Efficient C++ implementation
  • T5 (Raffel et al., 2020): Unified text-to-text framework
  • NLLB-200 (Koishekenov et al., 2022): Large-scale multilingual translation

4. Internal Mechanisms of Translation Systems

  • This Paper's Contribution: First quantification of layer-wise translation improvement across decoder blocks (Table 7)
  • Related Research: Gross et al. (2025) and Koresh et al. (2025) on Transformer learning mechanisms

Conclusions and Discussion

Main Conclusions

  1. Translation Entropy is Measurable: Through statistical analysis of token replacements maintaining translation invariance, translator entropy can be quantified
  2. Mutual Translation Entropy May Be Asymmetric: English-French translation shows 2.6-fold asymmetry, while English-Hebrew is approximately symmetric, indicating intrinsic structural differences between language pairs
  3. Dual-Token Multiplicative Law: SG(T_α, T_β) ≈ 0.5-0.9 × SG(T_α) × SG(T_β), revealing semantic correlations between tokens
  4. Non-Linear Relationship Between Model Size and Performance: MarianMT (75M parameters) outperforms NLLB-200 (615M parameters) on certain tasks
  5. Progressive Optimization Across Decoder Layers: Translation entropy decreases exponentially across decoder layers (from 10,712 to 116)

Limitations

1. Methodological Level

  • Entropy Ambiguity: Different P_i distributions may produce identical entropy values; requires combining |SG| and N_Av for comprehensive interpretation
  • Sample Size Constraints: Using only 100 pivot tokens and 30 sentences; statistical robustness needs improvement
  • Computational Complexity: Dual-token analysis tested on only ~100 sentences due to combinatorial explosion

2. Theoretical Level

  • Unknown Optimal Entropy: Cannot determine language's minimum achievable entropy; only relative comparison possible
  • Inevitability of Synonymy: Zero entropy unrealistic due to inherent synonymy in natural language
  • Unclear Asymmetry Source: Cannot distinguish whether asymmetry stems from language structure or model training

3. Experimental Level

  • Dataset Dependency: Results based on Opus100; other datasets may produce different results
  • Limited Language Pairs: Only English-French and English-Hebrew tested; broader coverage needed
  • Threshold Selection: While results robust within β_c = 5-10 range, optimal value still needs theoretical guidance

Future Directions

  1. Extension to More Language Pairs: Construct language clusters; distinguish symmetric/asymmetric mutual translation characteristics
  2. Pre-training for High-Entropy Tokens: Develop specialized training strategies for tokens with S(T_α) > 10
  3. Estimation of Theoretical Minimum Entropy: Explore entropy lower bounds for given language pairs
  4. Relationship with Model Architecture: Study effects of encoder/decoder layer count, attention heads, etc. on TE
  5. Online TE Estimation: Develop incremental estimation methods without requiring complete training datasets
  6. Multi-Token Extension: Investigate higher-order correlations for three or more token replacements

In-Depth Evaluation

Strengths

1. Methodological Innovation (★★★★★)

  • Paradigm Shift: First information-theoretic definition of computable translation entropy, circumventing single-language entropy estimation difficulties
  • Theoretical Depth: Combines Shannon entropy theory with modern deep learning, bridging statistical physics and NLP
  • Universality: Applicable to any encoder-decoder architecture, not limited to specific models

2. Experimental Sufficiency (★★★★☆)

  • Multi-Model Validation: Tests three mainstream translators (MarianMT, T5-Base, NLLB-200)
  • Multi-Language Pairs: English-French, French-English, English-Hebrew, Hebrew-English four directions
  • Complete Ablation Studies: Threshold robustness, no-threshold comparison, decoder block analysis
  • Limitation: Relatively limited pivot token count (100) and sentence count (30)

3. Result Convincingness (★★★★☆)

  • Important Findings:
    • Mutual translation asymmetry (2.6-fold English-French difference)
    • Dual-token multiplicative effect (coefficient 0.5-0.9)
    • Decoder entropy reduction pattern (exponential decrease)
  • Comparison with Traditional Metrics: TE partially aligns with BLEU/COMET but provides new perspective
  • Limitation: Not validated on larger-scale datasets (e.g., WMT)

4. Writing Clarity (★★★★★)

  • Rigorous Structure: From historical background → problem definition → method design → experimental validation; clear logic
  • Excellent Visualization: Figures 1-6 intuitively present concepts and results
  • Standard Mathematical Expression: Clear formula derivations and well-defined symbols

Weaknesses

1. Missing Statistical Significance Testing

  • No confidence intervals or standard deviations provided for S₉₅
  • Is 100 pivot tokens sufficient? Requires bootstrap validation

2. Insufficient Analysis of COMET/BLEU Contradictions

  • EN→FR: TE ranking T5-Base > MarianMT, but BLEU/COMET ranking opposite (Table 2 vs Table 8)
  • Only notes differences without exploring underlying causes (e.g., TE measures degeneracy rather than translation quality?)

3. Missing Computational Cost Analysis

  • Single token TE estimation requires 30×30,000 = 900,000 translations
  • 100 tokens require 90 million translations total; enormous computational cost
  • No discussion of complexity reduction strategies

4. Insufficient Theoretical Explanation

  • Why is English-French asymmetric while English-Hebrew symmetric? Only speculates "language structure differences"
  • What is the theoretical predicted value for dual-token coefficient 0.5-0.9?
  • What is the optimal distribution form for P_i?

5. Potential Experimental Design Biases

  • Pivot token frequency selection (500-1,500) may introduce mid-frequency word bias
  • Can 30 sentences represent all token usages?
  • Only uses training set sentences; generalization ability untested

Impact

1. Contribution to Field (★★★★☆)

  • Theoretical Contribution: Establishes operational definition of translation entropy, providing new dimension for translation system evaluation
  • Methodological Contribution: Token replacement + statistical analysis paradigm extensible to other NLP tasks (text generation, summarization)
  • Empirical Contribution: Reveals mutual translation asymmetry and decoder optimization mechanisms

2. Practical Value (★★★☆☆)

  • Advantages:
    • No need for manual reference translation annotation
    • Provides greater discrimination than COMET
    • Usable for model selection and hyperparameter tuning
  • Limitations:
    • High computational cost (90 million translations/100 tokens)
    • Requires model internal access (cannot evaluate API translation services)
    • Correlation with human evaluation unverified

3. Reproducibility (★★★★☆)

  • Strengths:
    • Detailed method description (algorithm steps, hyperparameters, datasets)
    • Uses public datasets (Opus100) and models (MarianMT, etc.)
  • Weaknesses:
    • No code link provided
    • Specific selection of 100 pivot tokens not disclosed
    • Selection criteria for 30 sentences unclear

Applicable Scenarios

1. Ideal Scenarios

  • Model Development: Compare translation degeneracy characteristics of different architectures (encoder/decoder layer count, attention mechanisms)
  • Linguistic Research: Study language pair symmetry; construct language clustering based on TE
  • Training Optimization: Identify high-entropy tokens; design targeted training strategies

2. Inapplicable Scenarios

  • Real-Time Evaluation: Computational cost too high for instant evaluation in online translation systems
  • Black-Box APIs: Requires model internal access; cannot evaluate GPT-4 and similar API services
  • Low-Resource Languages: Requires sufficient training data for pivot sentence selection

3. Potential Extensions

  • Text Generation: Evaluate generation diversity of GPT-class models (generation degeneracy)
  • Summarization Systems: Measure information compression rate from source text to summary
  • Dialogue Systems: Quantify semantic equivalence class size of responses

Key References

  1. Shannon, C.E. (1951): Prediction and entropy of printed English - Pioneering work on language entropy
  2. Vaswani et al. (2017): Attention is all you need - Transformer architecture
  3. Papineni et al. (2002): BLEU metric - Classical translation evaluation metric
  4. Rei et al. (2020): COMET - Neural translation evaluation framework
  5. Raffel et al. (2020): T5 - Unified text-to-text Transformer

Summary

The translation entropy framework proposed in this paper represents an important innovation in machine translation evaluation, providing a novel information-theoretic perspective. Its core advantages lie in requiring no reference translations and providing greater discrimination, while core findings (mutual translation asymmetry, dual-token multiplicative effect, decoder entropy reduction) hold significant theoretical and practical importance. However, high computational cost, insufficient theoretical explanation, and unexplored contradictions with traditional metrics are main limitations. Future work reducing computational complexity, extending to more language pairs, and deeply analyzing asymmetry sources could establish this method as a standard tool for translation system evaluation.

Recommendation Index: ★★★★☆ (4/5)
Suitable Readers: Machine translation researchers, scholars in information theory and NLP intersection, translation system developers