2025-11-30T09:01:18.756600

It Takes Two: A Dual Stage Approach for Terminology-Aware Translation

Jaswal
This paper introduces DuTerm, a novel two-stage architecture for terminology-constrained machine translation. Our system combines a terminology-aware NMT model, adapted via fine-tuning on large-scale synthetic data, with a prompt-based LLM for post-editing. The LLM stage refines NMT output and enforces terminology adherence. We evaluate DuTerm on English-to German, English-to-Spanish, and English-to-Russian with the WMT 2025 Terminology Shared Task corpus. We demonstrate that flexible, context-driven terminology handling by the LLM consistently yields higher quality translations than strict constraint enforcement. Our results highlight a critical trade-off, revealing that an LLM's work best for high-quality translation as context-driven mutators rather than generators.
academic

It Takes Two: A Dual Stage Approach for Terminology-Aware Translation

Basic Information

  • Paper ID: 2511.07461
  • Title: It Takes Two: A Dual Stage Approach for Terminology-Aware Translation
  • Author: Akshat Singh Jaswal (PES University)
  • Classification: cs.CL, cs.AI
  • Publication Date/Venue: Submitted to arXiv in November 2025, participating in WMT 2025 Terminology Shared Task
  • Paper Link: https://arxiv.org/abs/2511.07461

Abstract

This paper proposes DuTerm, a dual-stage architecture for terminology-constrained machine translation. The system combines a terminology-aware neural machine translation (NMT) model with prompt-based large language model (LLM) post-editing. The NMT model is fine-tuned on large-scale synthetic data, while the LLM stage refines NMT outputs and enforces terminology compliance. The authors evaluate the system on the WMT 2025 Terminology Translation Shared Task for English-to-German, Spanish, and Russian translation. Experiments demonstrate that the LLM's flexible, context-driven terminology handling consistently produces higher-quality translations than strict constraint enforcement, revealing the advantage of LLMs as context-driven "editors" rather than "generators" in high-quality translation.

Research Background and Motivation

1. Core Problem to Address

In specialized domains such as law, medicine, and engineering, accurate and consistent translation of domain-specific terminology is a critical challenge for machine translation. While modern neural machine translation systems have achieved significant fluency on general text, their performance on terminology-constrained text remains suboptimal.

2. Problem Importance

  • Precision Requirements: Professional domain translation demands extremely high terminology accuracy, as errors can have serious consequences
  • Consistency Needs: The same terminology must maintain consistent translation throughout a document
  • Morphological Challenges: In morphologically rich languages like German and Russian, terminology requires correct inflectional variations

3. Limitations of Existing Methods

Existing terminology-constrained translation methods fall into two main categories:

Inference-time Methods:

  • Directly impose constraints during the decoding process (e.g., constrained beam search)
  • Advantages: Effectively enforce constraints
  • Disadvantages: High computational overhead, may compromise fluency and grammatical correctness

Training-time Methods:

  • Integrate terminology information into training data through special tags
  • Advantages: Generate more natural outputs
  • Disadvantages: Cannot guarantee all constraints are satisfied during inference

4. Research Motivation

This paper argues that terminology-constrained translation is not merely a lexical substitution problem but requires deep understanding of linguistic context, particularly when handling complex morphology. DuTerm aims to combine the advantages of both approaches, ensuring terminology accuracy while maintaining translation quality.

Core Contributions

  1. Proposes DuTerm Dual-Stage Architecture: Innovatively combines training-time and inference-time methods through synergistic NMT+LLM collaboration to achieve terminology-aware translation
  2. Large-Scale Synthetic Data Generation Pipeline: Develops a systematic method for generating terminology-annotated synthetic data, including single-terminology and multi-terminology patterns, producing 10k-15k high-quality parallel sentence pairs per language direction
  3. Flexible Terminology Handling Strategy: Proposes three terminology processing modes (noterm, proper, random), allowing dynamic constraint strength selection based on context
  4. Multilingual Evaluation: Conducts comprehensive evaluation on English→German, Spanish, and Russian language pairs, validating cross-linguistic effectiveness
  5. Key Insight: Experiments demonstrate that LLMs are more effective as "context-driven editors" than as "zero-shot generators," revealing the trade-off between strict constraints and translation quality

Methodology Details

Task Definition

Input: Source language sentence (English) + terminology dictionary (source-target terminology pairs) Output: Target language translation where specified terminology is correctly translated and marked with labels Constraints: Must use target terminology provided in the dictionary while maintaining translation fluency and grammatical correctness

Model Architecture

DuTerm employs a two-stage pipeline architecture:

Stage 1: Terminology-Aware Neural Machine Translation

1. Terminology Extraction and Analysis

  • Parse WMT 2025 development set to construct bilingual terminology dictionary
  • Extract over 1,000 unique terminology pairs per translation direction
  • Use repetition_ids to track terminology and occurrence frequency
  • Leverage LLM to generate additional terminology similar to dictionary terms

2. Synthetic Data Generation Generate parallel sentence pairs with terminology labels using two modes:

  • Single-Terminology Mode: Each sentence pair contains only one terminology instance
  • Multi-Terminology Mode: Randomly select 2-3 terminology pairs to co-occur, training co-occurrence handling and disambiguation abilities

Technical Details:

  • Temperature sampling: 0.3-0.7
  • Concurrent generation
  • Strict parsing to ensure format correctness
  • Explicitly insert boundary tags [TERM]...[/TERM] in both source and target languages

3. Label Normalization and Quality Filtering

  • Re-annotation: Enforce consistent annotation standards
  • Longest-First Matching: Prevent partial masking
  • Case Handling: Case-insensitive detection while preserving original casing
  • Reverse Mapping: Ensure symmetric annotation on target side
  • Quality Scoring: Score each sentence pair using COMETQE
  • Deduplication: Deduplicate on source side
  • Threshold Filtering: Conservative thresholds (0.85-0.9), typically retaining 60-70% of output
  • Final Output: Approximately 10k-15k high-quality sentence pairs per language direction

4. Multilingual Model Adaptation

  • Base Model: NLLB-200 3.3B (multilingual neural machine translation model)
  • Vocabulary Extension: Add terminology marker tokens ([TERM], [/TERM]), ensuring atomic processing to prevent subword tokenization from breaking markers
  • Training Strategy:
    • Parameter-Efficient Fine-Tuning
    • Multilingual joint training: Merge filtered datasets from three target languages
    • Cross-lingual transfer learning

Stage 2: LLM-Based Post-Editing

1. Post-Editing Process

  • Input: Source sentence + NMT translation + source-target terminology mapping
  • LLM Selection: GPT-4o (high quality + relatively low cost)
  • Instructions: Preserve semantics, apply precise target terminology, maintain labels, improve readability without rewriting constraints

2. Terminology-Aware Processing

  • Dynamic Parsing: Select proper/random/noterm constraints from reference terminology database based on input
  • Mode Adaptation:
    • When constraints exist: Enforce strictly
    • When no constraints: Perform quality editing only, but remain sensitive to technical terminology
  • Constraint Satisfaction: Include explicit mappings and format rules in prompts

3. Quality Assurance and Robustness

  • Low Temperature Sampling: Temperature 0.3 ensures deterministic editing
  • Verification Mechanisms: Use predefined parser to verify format, label integrity, and constraint satisfaction
  • Structural Checks: Verify filename patterns, presence of all terminology patterns, JSONL structure
  • Quality Assessment:
    • Score using COMETQE after removing labels
    • Check terminology retention rate through exact matching

Technical Innovations

  1. Collaborative Architecture Design: NMT provides structured preliminary translation, LLM focuses on high-level improvements (disambiguation, word order adjustment, contextual refinement), avoiding complexity of zero-shot generation
  2. Synthetic Data Quality Control: Multi-stage filtering (COMETQE scoring + deduplication + high thresholds) ensures training data quality
  3. Flexible Constraint Strategy: Three modes (noterm/proper/random) allow balancing between terminology accuracy and translation naturalness
  4. End-to-End Verification: Comprehensive quality assurance mechanisms from data generation to final output

Experimental Setup

Datasets

  • Source: WMT 2025 Terminology Shared Task
  • Language Pairs: English→German (DE), English→Spanish (ES), English→Russian (RU)
  • Terminology Dictionary: >1,000 terminology pairs per direction
  • Synthetic Training Data: 10k-15k sentence pairs per direction
  • Base Model Training Data: Multilingual data from NLLB-200 pre-training

Evaluation Metrics

  1. BLEU: Overall translation adequacy, measuring n-gram precision
  2. chrF2++: Character-level fluency and robustness, more sensitive to morphological variations
  3. Terminology Success Rate (TSR):
    • Proper SR: Usage rate of correct terminology
    • Random SR: Usage rate of random terminology

Comparison Methods

Self-comparison of three terminology handling strategies:

  • noterm: Unconstrained translation (baseline)
  • proper: Strict terminology enforcement
  • random: Random terminology enforcement (testing whether model can enforce inappropriate terminology)

Implementation Details

  • NMT Fine-tuning:
    • Base Model: NLLB-200 3.3B
    • Optimization Strategy: Parameter-Efficient Fine-Tuning
    • Training Data: Multilingual mixture (10k-15k/language)
  • LLM Post-Editing:
    • Model: GPT-4o
    • Temperature: 0.3
    • Prompt Engineering: Detailed prompt templates in Appendices A.1-A.4
  • Quality Control:
    • COMETQE Threshold: 0.85-0.9
    • Retention Rate: 60-70%

Experimental Results

Main Results

Table 1: Evaluation Results for Three Language Pairs and Three Strategies

LanguageTypeBLEUchrF2++Proper SRRandom SR
DEnoterm38.2462.610.430.69
proper48.0670.740.980.73
random43.7767.220.480.99
ESnoterm45.9867.050.470.73
proper58.5176.080.990.78
random53.2872.050.490.98
RUnoterm27.8855.290.390.69
proper35.8063.570.980.72
random32.2559.850.420.99

Key Findings

  1. Strict Terminology Enforcement Shows Significant Effects:
    • Proper mode achieves highest BLEU and chrF2++ across all languages
    • German: 48.06 BLEU (vs 38.24 noterm, +25.7%)
    • Spanish: 58.51 BLEU (vs 45.98 noterm, +27.2%)
    • Russian: 35.80 BLEU (vs 27.88 noterm, +28.4%)
    • Proper terminology success rate ≥0.97, approaching perfection
  2. Unconstrained Translation Performs Worst:
    • Noterm achieves lowest BLEU and chrF2++ across all languages
    • Fluency is acceptable, but terminology accuracy is poor (proper SR: 0.39-0.47)
  3. Random Terminology Enforcement Trade-offs:
    • Random mode produces moderate BLEU/chrF2++
    • Random terminology success rate ≈0.98, proving model can enforce arbitrary terminology
    • However, this compromises contextual appropriateness
  4. Language-Specific Trends:
    • Spanish: Highest overall scores (structure similar to English)
    • Russian: Largest gap between proper and noterm (morphologically rich language terminology control difficulty)
    • German: Moderate performance, but significant improvement in proper mode

Experimental Findings

  1. Quality-Constraint Trade-off: Strict enforcement maximizes terminology accuracy and improves surface quality metrics, but may occasionally reduce flexibility
  2. LLM as Editor Advantage: Starting from NMT's structured preliminary translation, LLMs can focus on high-level improvements more effectively than zero-shot generation
  3. Cross-Linguistic Consistency: Consistent trends across three languages validate method universality
  4. Morphological Challenges: Russian's low baseline scores and large improvement space highlight terminology handling difficulty in morphologically rich languages

1. Terminology-Constrained Machine Translation

  • Inference-time Methods:
    • Constrained Beam Search
    • N-best list reranking
    • Recent work (Zhang et al., 2023) exploring efficiency improvements
  • Training-time Methods:
    • Special tag annotation (Dinu et al., 2019)
    • Vocabulary-constrained Levenshtein Transformer (Susanto et al., 2020)

2. LLMs for Machine Translation

  • Domain terminology integration (Moslem et al., 2023)
  • GPT-4 automatic translation post-editing (Raunak et al., 2023)

3. Multilingual NMT

  • Transformer architecture (Vaswani et al., 2023)
  • NLLB-200 (Team et al., 2022): No Language Left Behind human-centered translation
  • Google Multilingual NMT (Johnson et al., 2017): Zero-shot translation

4. Advantages of This Work

  • Method Fusion: First systematic combination of training-time tags and inference-time LLM post-editing
  • Large-Scale Synthetic Data: Quality-controlled automatic generation pipeline
  • Flexible Strategy: Dynamic terminology handling rather than binary choices

Conclusions and Discussion

Main Conclusions

  1. Dual-Stage Architecture is Effective: DuTerm successfully combines NMT and LLM advantages, balancing terminology accuracy and translation quality
  2. Flexible Handling Outperforms Strict Constraints: While proper mode performs best on automatic metrics, LLM's context-driven handling capability is the key success factor
  3. LLM Positioning: LLM as "editor" (improving upon NMT output) is more effective than as "generator" (translating from scratch)
  4. Cross-Linguistic Validation: Method is effective across three typologically diverse languages (German, Spanish, Russian)

Limitations

The authors explicitly identify the following constraints:

  1. Prompt Dependency:
    • Highly dependent on carefully designed prompts
    • May not generalize well across domains, languages, or model architectures
  2. Sequential Processing Limitations:
    • Sequential terminology matching and translation refinement limits adaptive constraint enforcement capability
  3. Sentence-Level Processing:
    • Ignores document-level consistency and context-aware terminology usage opportunities
    • These are critical in real translation tasks
  4. Model Singularity:
    • Evaluated only on GPT-4o, limiting generalization of findings
  5. Domain Limitations:
    • Focuses on technical and business domains
    • May not capture challenges in specialized fields like medicine or law
  6. Evaluation Metric Limitations:
    • COMETQE, BLEU, chrF++ provide automated scalability
    • But may not fully reflect terminology accuracy and contextual appropriateness
    • Human evaluation needed as supplement

Future Directions

  1. Adaptive Learning Mechanisms:
    • Dynamically integrate terminology rather than relying on static prompts
    • Enhance cross-domain and cross-lingual robustness
  2. End-to-End Architecture:
    • Memory-augmented architecture maintaining cross-sentence and document consistency
    • More coherent output
  3. Extended Evaluation:
    • Other language models
    • Diverse domain-specific corpora
    • Validate generalization and reveal domain-dependent challenges
  4. Hybrid Strategies:
    • Combine prompt guidance with fine-tuning or reinforcement learning
    • User-driven terminology control interaction
    • Improve usability and accuracy
  5. Document-Level Processing:
    • Move beyond sentence level to achieve document-level consistency

In-Depth Evaluation

Strengths

  1. Method Innovation:
    • Dual-stage architecture cleverly combines NMT and LLM advantages
    • Not simple stacking, but clear division of labor: NMT provides structure, LLM refines context
    • Flexible three-mode strategy (noterm/proper/random) allows fine-grained control
  2. Engineering Completeness:
    • Detailed synthetic data generation pipeline with multiple quality controls
    • Systematic label normalization process
    • End-to-end verification mechanisms
    • Provides complete prompt templates (appendices) for strong reproducibility
  3. Experimental Sufficiency:
    • Three language pairs with significant typological differences
    • Systematic comparison of three terminology handling strategies
    • Multi-dimensional evaluation (BLEU, chrF2++, terminology success rate)
    • Consistent results with clear trends
  4. Insight Value:
    • "LLM as editor vs. generator" finding has universal significance
    • Reveals trade-off between terminology constraints and translation quality
    • Provides clear direction for future research
  5. Writing Clarity:
    • Clear structure and logical flow
    • Sufficient technical details
    • Candid discussion of limitations

Weaknesses

  1. Insufficient Baseline Comparisons:
    • Primarily self-comparison (three modes)
    • Lacks direct comparison with other SOTA terminology-constrained translation methods
    • No comparison with pure NMT or pure LLM approaches
  2. Missing Human Evaluation:
    • Completely relies on automatic metrics
    • Contextual appropriateness of terminology, translation naturalness require human judgment
    • Does high proper mode score truly mean better translation quality?
  3. Insufficient Ablation Studies:
    • NMT stage contribution not separately evaluated
    • Specific improvement types from LLM post-editing not analyzed
    • Impact of synthetic data quantity on performance not explored
  4. Missing Cost Analysis:
    • GPT-4o usage costs not discussed
    • Inference time not reported
    • Real deployment feasibility unclear
  5. Insufficient Case Analysis:
    • No specific translation examples provided
    • Difficult to intuitively understand model behavior
    • Error type analysis missing
  6. Insufficient Generalization Verification:
    • Only one LLM (GPT-4o)
    • Only technical and business domains
    • Other open-source LLMs (Llama, Mistral) not tested

Impact

  1. Contribution to Field:
    • Provides new paradigm for terminology-constrained translation
    • Dual-stage architecture may inspire subsequent research
    • "Editor vs. generator" insight has theoretical value
  2. Practical Value:
    • Moderate: Method depends on GPT-4o, cost may limit large-scale application
    • But approach transferable to open-source models
    • Synthetic data generation pipeline has practical value
  3. Reproducibility:
    • Good: Provides detailed prompt templates
    • Clear method description
    • But GPT-4o dependency may affect complete reproduction
  4. Value for Subsequent Research:
    • Provides baseline for WMT 2025 task
    • Flexible constraint strategy worth deeper exploration
    • Document-level extension is natural next step

Applicable Scenarios

  1. Most Suitable:
    • Technical document translation (IT, finance)
    • Scenarios with clear terminology dictionaries
    • Applications requiring high terminology consistency but tolerating some cost
  2. Potentially Suitable:
    • Business contract translation
    • Product manual localization
    • Enterprise internal document translation
  3. Less Suitable:
    • Real-time translation (cost and latency)
    • Resource-constrained environments (depends on large LLMs)
    • Literary translation (over-constraint may damage creativity)
    • Highly specialized domains (medicine, law require more domain validation)
  4. Potentially Suitable After Improvements:
    • After replacing GPT-4o with open-source LLMs: Low-cost scenarios
    • After extending to document level: Long document translation
    • After adding human interaction: CAT tool integration

References

Key Citations

  1. Dinu et al., 2019: Training neural machine translation to apply terminology constraints - Representative work on training-time tag methods
  2. Raunak et al., 2023: Leveraging GPT-4 for automatic translation post-editing - Direct inspiration for LLM post-editing
  3. Team et al., 2022: NLLB-200 - Base multilingual NMT model used in this work
  4. Moslem et al., 2023: Domain terminology integration into machine translation - Related work on domain terminology integration
  5. Zhang et al., 2023: Understanding and improving the robustness of terminology constraints - Recent advances in inference-time constraint methods
  6. Rei et al., 2022: CometKiwi/COMETQE - Quality assessment metrics used in this work
  7. Vaswani et al., 2023: Attention is all you need - Transformer architecture foundation

Overall Evaluation

DuTerm is a well-engineered, clear-thinking application research paper. Its core contribution lies in proposing a practical dual-stage architecture that cleverly combines NMT and LLM advantages for terminology-constrained translation. The insight that "LLMs are more effective as editors rather than generators" has universal value and may influence future hybrid translation system design.

However, the paper has shortcomings in experimental depth (lacking comparisons with other methods, human evaluation) and generalization verification (single LLM, limited domains). Additionally, dependence on GPT-4o may limit applicability in resource-constrained scenarios.

Overall, this is a solid shared task participation paper that provides valuable methods and insights, but requires more follow-up work to validate effectiveness and practicality in broader scenarios. For researchers working on machine translation, particularly terminology-constrained translation, the dual-stage approach and synthetic data generation pipeline provided in this paper have reference value.