2025-11-30T09:01:18.756600

It Takes Two: A Dual Stage Approach for Terminology-Aware Translation

Jaswal

This paper introduces DuTerm, a novel two-stage architecture for terminology-constrained machine translation. Our system combines a terminology-aware NMT model, adapted via fine-tuning on large-scale synthetic data, with a prompt-based LLM for post-editing. The LLM stage refines NMT output and enforces terminology adherence. We evaluate DuTerm on English-to German, English-to-Spanish, and English-to-Russian with the WMT 2025 Terminology Shared Task corpus. We demonstrate that flexible, context-driven terminology handling by the LLM consistently yields higher quality translations than strict constraint enforcement. Our results highlight a critical trade-off, revealing that an LLM's work best for high-quality translation as context-driven mutators rather than generators.

academic

It Takes Two: A Dual Stage Approach for Terminology-Aware Translation

Basic Information

Paper ID: 2511.07461
Title: It Takes Two: A Dual Stage Approach for Terminology-Aware Translation
Author: Akshat Singh Jaswal (PES University)
Classification: cs.CL, cs.AI
Publication Date/Venue: Submitted to arXiv in November 2025, participating in WMT 2025 Terminology Shared Task
Paper Link: https://arxiv.org/abs/2511.07461

Abstract

This paper proposes DuTerm, a dual-stage architecture for terminology-constrained machine translation. The system combines a terminology-aware neural machine translation (NMT) model with prompt-based large language model (LLM) post-editing. The NMT model is fine-tuned on large-scale synthetic data, while the LLM stage refines NMT outputs and enforces terminology compliance. The authors evaluate the system on the WMT 2025 Terminology Translation Shared Task for English-to-German, Spanish, and Russian translation. Experiments demonstrate that the LLM's flexible, context-driven terminology handling consistently produces higher-quality translations than strict constraint enforcement, revealing the advantage of LLMs as context-driven "editors" rather than "generators" in high-quality translation.

Research Background and Motivation

1. Core Problem to Address

In specialized domains such as law, medicine, and engineering, accurate and consistent translation of domain-specific terminology is a critical challenge for machine translation. While modern neural machine translation systems have achieved significant fluency on general text, their performance on terminology-constrained text remains suboptimal.

2. Problem Importance

Precision Requirements: Professional domain translation demands extremely high terminology accuracy, as errors can have serious consequences
Consistency Needs: The same terminology must maintain consistent translation throughout a document
Morphological Challenges: In morphologically rich languages like German and Russian, terminology requires correct inflectional variations

3. Limitations of Existing Methods

Existing terminology-constrained translation methods fall into two main categories:

Inference-time Methods:

Directly impose constraints during the decoding process (e.g., constrained beam search)
Advantages: Effectively enforce constraints
Disadvantages: High computational overhead, may compromise fluency and grammatical correctness

Training-time Methods:

Integrate terminology information into training data through special tags
Advantages: Generate more natural outputs
Disadvantages: Cannot guarantee all constraints are satisfied during inference

4. Research Motivation

This paper argues that terminology-constrained translation is not merely a lexical substitution problem but requires deep understanding of linguistic context, particularly when handling complex morphology. DuTerm aims to combine the advantages of both approaches, ensuring terminology accuracy while maintaining translation quality.

Core Contributions

Proposes DuTerm Dual-Stage Architecture: Innovatively combines training-time and inference-time methods through synergistic NMT+LLM collaboration to achieve terminology-aware translation
Large-Scale Synthetic Data Generation Pipeline: Develops a systematic method for generating terminology-annotated synthetic data, including single-terminology and multi-terminology patterns, producing 10k-15k high-quality parallel sentence pairs per language direction
Flexible Terminology Handling Strategy: Proposes three terminology processing modes (noterm, proper, random), allowing dynamic constraint strength selection based on context
Multilingual Evaluation: Conducts comprehensive evaluation on English→German, Spanish, and Russian language pairs, validating cross-linguistic effectiveness
Key Insight: Experiments demonstrate that LLMs are more effective as "context-driven editors" than as "zero-shot generators," revealing the trade-off between strict constraints and translation quality

Methodology Details

Task Definition

Input: Source language sentence (English) + terminology dictionary (source-target terminology pairs) Output: Target language translation where specified terminology is correctly translated and marked with labels Constraints: Must use target terminology provided in the dictionary while maintaining translation fluency and grammatical correctness

Model Architecture

DuTerm employs a two-stage pipeline architecture:

Stage 1: Terminology-Aware Neural Machine Translation

1. Terminology Extraction and Analysis

Parse WMT 2025 development set to construct bilingual terminology dictionary
Extract over 1,000 unique terminology pairs per translation direction
Use repetition_ids to track terminology and occurrence frequency
Leverage LLM to generate additional terminology similar to dictionary terms

2. Synthetic Data Generation Generate parallel sentence pairs with terminology labels using two modes:

Single-Terminology Mode: Each sentence pair contains only one terminology instance
Multi-Terminology Mode: Randomly select 2-3 terminology pairs to co-occur, training co-occurrence handling and disambiguation abilities

Technical Details:

Temperature sampling: 0.3-0.7
Concurrent generation
Strict parsing to ensure format correctness
Explicitly insert boundary tags [TERM]...[/TERM] in both source and target languages

3. Label Normalization and Quality Filtering

Re-annotation: Enforce consistent annotation standards
Longest-First Matching: Prevent partial masking
Case Handling: Case-insensitive detection while preserving original casing
Reverse Mapping: Ensure symmetric annotation on target side
Quality Scoring: Score each sentence pair using COMETQE
Deduplication: Deduplicate on source side
Threshold Filtering: Conservative thresholds (0.85-0.9), typically retaining 60-70% of output
Final Output: Approximately 10k-15k high-quality sentence pairs per language direction

4. Multilingual Model Adaptation

Base Model: NLLB-200 3.3B (multilingual neural machine translation model)
Vocabulary Extension: Add terminology marker tokens ([TERM], [/TERM]), ensuring atomic processing to prevent subword tokenization from breaking markers
Training Strategy:
- Parameter-Efficient Fine-Tuning
- Multilingual joint training: Merge filtered datasets from three target languages
- Cross-lingual transfer learning

Stage 2: LLM-Based Post-Editing

1. Post-Editing Process

Input: Source sentence + NMT translation + source-target terminology mapping
LLM Selection: GPT-4o (high quality + relatively low cost)
Instructions: Preserve semantics, apply precise target terminology, maintain labels, improve readability without rewriting constraints

2. Terminology-Aware Processing

Dynamic Parsing: Select proper/random/noterm constraints from reference terminology database based on input
Mode Adaptation:
- When constraints exist: Enforce strictly
- When no constraints: Perform quality editing only, but remain sensitive to technical terminology
Constraint Satisfaction: Include explicit mappings and format rules in prompts

3. Quality Assurance and Robustness

Low Temperature Sampling: Temperature 0.3 ensures deterministic editing
Verification Mechanisms: Use predefined parser to verify format, label integrity, and constraint satisfaction
Structural Checks: Verify filename patterns, presence of all terminology patterns, JSONL structure
Quality Assessment:
- Score using COMETQE after removing labels
- Check terminology retention rate through exact matching

Technical Innovations

Collaborative Architecture Design: NMT provides structured preliminary translation, LLM focuses on high-level improvements (disambiguation, word order adjustment, contextual refinement), avoiding complexity of zero-shot generation
Synthetic Data Quality Control: Multi-stage filtering (COMETQE scoring + deduplication + high thresholds) ensures training data quality
Flexible Constraint Strategy: Three modes (noterm/proper/random) allow balancing between terminology accuracy and translation naturalness
End-to-End Verification: Comprehensive quality assurance mechanisms from data generation to final output

Experimental Setup

Datasets

Source: WMT 2025 Terminology Shared Task
Language Pairs: English→German (DE), English→Spanish (ES), English→Russian (RU)
Terminology Dictionary: >1,000 terminology pairs per direction
Synthetic Training Data: 10k-15k sentence pairs per direction
Base Model Training Data: Multilingual data from NLLB-200 pre-training

Evaluation Metrics

BLEU: Overall translation adequacy, measuring n-gram precision
chrF2++: Character-level fluency and robustness, more sensitive to morphological variations
Terminology Success Rate (TSR):
- Proper SR: Usage rate of correct terminology
- Random SR: Usage rate of random terminology

Comparison Methods

Self-comparison of three terminology handling strategies:

noterm: Unconstrained translation (baseline)
proper: Strict terminology enforcement
random: Random terminology enforcement (testing whether model can enforce inappropriate terminology)

Implementation Details

NMT Fine-tuning:
- Base Model: NLLB-200 3.3B
- Optimization Strategy: Parameter-Efficient Fine-Tuning
- Training Data: Multilingual mixture (10k-15k/language)
LLM Post-Editing:
- Model: GPT-4o
- Temperature: 0.3
- Prompt Engineering: Detailed prompt templates in Appendices A.1-A.4
Quality Control:
- COMETQE Threshold: 0.85-0.9
- Retention Rate: 60-70%

Experimental Results

Main Results

Table 1: Evaluation Results for Three Language Pairs and Three Strategies

Language	Type	BLEU	chrF2++	Proper SR	Random SR
DE	noterm	38.24	62.61	0.43	0.69
	proper	48.06	70.74	0.98	0.73
	random	43.77	67.22	0.48	0.99
ES	noterm	45.98	67.05	0.47	0.73
	proper	58.51	76.08	0.99	0.78
	random	53.28	72.05	0.49	0.98
RU	noterm	27.88	55.29	0.39	0.69
	proper	35.80	63.57	0.98	0.72
	random	32.25	59.85	0.42	0.99

Key Findings

Strict Terminology Enforcement Shows Significant Effects:
- Proper mode achieves highest BLEU and chrF2++ across all languages
- German: 48.06 BLEU (vs 38.24 noterm, +25.7%)
- Spanish: 58.51 BLEU (vs 45.98 noterm, +27.2%)
- Russian: 35.80 BLEU (vs 27.88 noterm, +28.4%)
- Proper terminology success rate ≥0.97, approaching perfection
Unconstrained Translation Performs Worst:
- Noterm achieves lowest BLEU and chrF2++ across all languages
- Fluency is acceptable, but terminology accuracy is poor (proper SR: 0.39-0.47)
Random Terminology Enforcement Trade-offs:
- Random mode produces moderate BLEU/chrF2++
- Random terminology success rate ≈0.98, proving model can enforce arbitrary terminology
- However, this compromises contextual appropriateness
Language-Specific Trends:
- Spanish: Highest overall scores (structure similar to English)
- Russian: Largest gap between proper and noterm (morphologically rich language terminology control difficulty)
- German: Moderate performance, but significant improvement in proper mode

Experimental Findings

Quality-Constraint Trade-off: Strict enforcement maximizes terminology accuracy and improves surface quality metrics, but may occasionally reduce flexibility
LLM as Editor Advantage: Starting from NMT's structured preliminary translation, LLMs can focus on high-level improvements more effectively than zero-shot generation
Cross-Linguistic Consistency: Consistent trends across three languages validate method universality
Morphological Challenges: Russian's low baseline scores and large improvement space highlight terminology handling difficulty in morphologically rich languages

1. Terminology-Constrained Machine Translation

Inference-time Methods:
- Constrained Beam Search
- N-best list reranking
- Recent work (Zhang et al., 2023) exploring efficiency improvements
Training-time Methods:
- Special tag annotation (Dinu et al., 2019)
- Vocabulary-constrained Levenshtein Transformer (Susanto et al., 2020)

2. LLMs for Machine Translation

Domain terminology integration (Moslem et al., 2023)
GPT-4 automatic translation post-editing (Raunak et al., 2023)

3. Multilingual NMT

Transformer architecture (Vaswani et al., 2023)
NLLB-200 (Team et al., 2022): No Language Left Behind human-centered translation
Google Multilingual NMT (Johnson et al., 2017): Zero-shot translation

4. Advantages of This Work

Method Fusion: First systematic combination of training-time tags and inference-time LLM post-editing
Large-Scale Synthetic Data: Quality-controlled automatic generation pipeline
Flexible Strategy: Dynamic terminology handling rather than binary choices

Conclusions and Discussion

Main Conclusions

Dual-Stage Architecture is Effective: DuTerm successfully combines NMT and LLM advantages, balancing terminology accuracy and translation quality
Flexible Handling Outperforms Strict Constraints: While proper mode performs best on automatic metrics, LLM's context-driven handling capability is the key success factor
LLM Positioning: LLM as "editor" (improving upon NMT output) is more effective than as "generator" (translating from scratch)
Cross-Linguistic Validation: Method is effective across three typologically diverse languages (German, Spanish, Russian)

Limitations

The authors explicitly identify the following constraints:

Prompt Dependency:
- Highly dependent on carefully designed prompts
- May not generalize well across domains, languages, or model architectures
Sequential Processing Limitations:
- Sequential terminology matching and translation refinement limits adaptive constraint enforcement capability
Sentence-Level Processing:
- Ignores document-level consistency and context-aware terminology usage opportunities
- These are critical in real translation tasks
Model Singularity:
- Evaluated only on GPT-4o, limiting generalization of findings
Domain Limitations:
- Focuses on technical and business domains
- May not capture challenges in specialized fields like medicine or law
Evaluation Metric Limitations:
- COMETQE, BLEU, chrF++ provide automated scalability
- But may not fully reflect terminology accuracy and contextual appropriateness
- Human evaluation needed as supplement

Future Directions

Adaptive Learning Mechanisms:
- Dynamically integrate terminology rather than relying on static prompts
- Enhance cross-domain and cross-lingual robustness
End-to-End Architecture:
- Memory-augmented architecture maintaining cross-sentence and document consistency
- More coherent output
Extended Evaluation:
- Other language models
- Diverse domain-specific corpora
- Validate generalization and reveal domain-dependent challenges
Hybrid Strategies:
- Combine prompt guidance with fine-tuning or reinforcement learning
- User-driven terminology control interaction
- Improve usability and accuracy
Document-Level Processing:
- Move beyond sentence level to achieve document-level consistency

In-Depth Evaluation

Strengths

Method Innovation:
- Dual-stage architecture cleverly combines NMT and LLM advantages
- Not simple stacking, but clear division of labor: NMT provides structure, LLM refines context
- Flexible three-mode strategy (noterm/proper/random) allows fine-grained control
Engineering Completeness:
- Detailed synthetic data generation pipeline with multiple quality controls
- Systematic label normalization process
- End-to-end verification mechanisms
- Provides complete prompt templates (appendices) for strong reproducibility
Experimental Sufficiency:
- Three language pairs with significant typological differences
- Systematic comparison of three terminology handling strategies
- Multi-dimensional evaluation (BLEU, chrF2++, terminology success rate)
- Consistent results with clear trends
Insight Value:
- "LLM as editor vs. generator" finding has universal significance
- Reveals trade-off between terminology constraints and translation quality
- Provides clear direction for future research
Writing Clarity:
- Clear structure and logical flow
- Sufficient technical details
- Candid discussion of limitations

Weaknesses

Insufficient Baseline Comparisons:
- Primarily self-comparison (three modes)
- Lacks direct comparison with other SOTA terminology-constrained translation methods
- No comparison with pure NMT or pure LLM approaches
Missing Human Evaluation:
- Completely relies on automatic metrics
- Contextual appropriateness of terminology, translation naturalness require human judgment
- Does high proper mode score truly mean better translation quality?
Insufficient Ablation Studies:
- NMT stage contribution not separately evaluated
- Specific improvement types from LLM post-editing not analyzed
- Impact of synthetic data quantity on performance not explored
Missing Cost Analysis:
- GPT-4o usage costs not discussed
- Inference time not reported
- Real deployment feasibility unclear
Insufficient Case Analysis:
- No specific translation examples provided
- Difficult to intuitively understand model behavior
- Error type analysis missing
Insufficient Generalization Verification:
- Only one LLM (GPT-4o)
- Only technical and business domains
- Other open-source LLMs (Llama, Mistral) not tested

Impact

Contribution to Field:
- Provides new paradigm for terminology-constrained translation
- Dual-stage architecture may inspire subsequent research
- "Editor vs. generator" insight has theoretical value
Practical Value:
- Moderate: Method depends on GPT-4o, cost may limit large-scale application
- But approach transferable to open-source models
- Synthetic data generation pipeline has practical value
Reproducibility:
- Good: Provides detailed prompt templates
- Clear method description
- But GPT-4o dependency may affect complete reproduction
Value for Subsequent Research:
- Provides baseline for WMT 2025 task
- Flexible constraint strategy worth deeper exploration
- Document-level extension is natural next step

Applicable Scenarios

Most Suitable:
- Technical document translation (IT, finance)
- Scenarios with clear terminology dictionaries
- Applications requiring high terminology consistency but tolerating some cost
Potentially Suitable:
- Business contract translation
- Product manual localization
- Enterprise internal document translation
Less Suitable:
- Real-time translation (cost and latency)
- Resource-constrained environments (depends on large LLMs)
- Literary translation (over-constraint may damage creativity)
- Highly specialized domains (medicine, law require more domain validation)
Potentially Suitable After Improvements:
- After replacing GPT-4o with open-source LLMs: Low-cost scenarios
- After extending to document level: Long document translation
- After adding human interaction: CAT tool integration

References

Key Citations

Dinu et al., 2019: Training neural machine translation to apply terminology constraints - Representative work on training-time tag methods
Raunak et al., 2023: Leveraging GPT-4 for automatic translation post-editing - Direct inspiration for LLM post-editing
Team et al., 2022: NLLB-200 - Base multilingual NMT model used in this work
Moslem et al., 2023: Domain terminology integration into machine translation - Related work on domain terminology integration
Zhang et al., 2023: Understanding and improving the robustness of terminology constraints - Recent advances in inference-time constraint methods
Rei et al., 2022: CometKiwi/COMETQE - Quality assessment metrics used in this work
Vaswani et al., 2023: Attention is all you need - Transformer architecture foundation

Overall Evaluation

DuTerm is a well-engineered, clear-thinking application research paper. Its core contribution lies in proposing a practical dual-stage architecture that cleverly combines NMT and LLM advantages for terminology-constrained translation. The insight that "LLMs are more effective as editors rather than generators" has universal value and may influence future hybrid translation system design.

However, the paper has shortcomings in experimental depth (lacking comparisons with other methods, human evaluation) and generalization verification (single LLM, limited domains). Additionally, dependence on GPT-4o may limit applicability in resource-constrained scenarios.

Overall, this is a solid shared task participation paper that provides valuable methods and insights, but requires more follow-up work to validate effectiveness and practicality in broader scenarios. For researchers working on machine translation, particularly terminology-constrained translation, the dual-stage approach and synthetic data generation pipeline provided in this paper have reference value.