2025-11-25T03:46:17.872017

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Jung, Kim, Kim et al.
Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.
academic

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Basic Information

  • Paper ID: 2510.10827
  • Title: Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
  • Authors: Haeji Jung, Jinju Kim, Kyungjin Kim, Youjeong Roh, David R. Mortensen
  • Classification: cs.CL cs.AI
  • Publication Date: October 12, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10827

Abstract

Transliteration has emerged as a promising approach to bridge linguistic gaps in multilingual NLP, demonstrating particular effectiveness for languages employing non-Latin scripts. This study investigates the degree to which shared scripts, overlapping vocabularies, and shared phonology contribute to multilingual model performance. Through controlled experiments employing three transliteration methods (romanization, phonemic transcription, and substitution cipher) alongside orthography, models are evaluated on two downstream tasks: Named Entity Recognition (NER) and Natural Language Inference (NLI). Results demonstrate that romanization significantly outperforms other input types in 7 of 8 evaluation settings, largely consistent with the authors' hypotheses. Further analysis reveals that sharing longer (subword) tokens with pretraining languages enables better utilization of model capacity.

Research Background and Motivation

Core Problem

The core problem addressed by this research is the script barrier phenomenon: multilingual models struggle to share knowledge across languages with different writing systems due to mismatched input representations.

Problem Significance

  1. Multilingual Fairness: Most pretrained language models are primarily trained on Latin scripts, providing insufficient support for non-Latin script languages
  2. Knowledge Transfer Barriers: Even in large-scale multilingual models, knowledge sharing across different writing systems remains challenging
  3. Resource Imbalance: Non-Latin script languages typically have fewer resources and require improved cross-lingual transfer methods

Limitations of Existing Approaches

  1. Lack of Systematic Analysis: While transliteration methods (e.g., romanization, phonemic conversion) are effective in practice, the reasons for their effectiveness lack deep investigation
  2. Factor Confounding: Existing research fails to clearly isolate the contributions of different factors in transliteration
  3. Limited Evaluation Scope: Most studies focus on similar languages (e.g., Indo-European families), lacking typological diversity

Research Motivation

The authors pose a central question: Is it the shared script itself or the linguistic information encoded within the script that helps models adapt to other languages?

Core Contributions

  1. Theoretical Framework: Defines three key factors for transliteration effectiveness—shared character sets, shared token sets, and shared phonology
  2. Systematic Experiments: Conducts controlled pretraining experiments across four language sets and four input types
  3. In-depth Analysis: Reveals mechanisms through which different transliteration methods produce different overlap patterns via vocabulary overlap analysis
  4. Important Findings: Demonstrates the critical role of sharing longer tokens for cross-lingual adaptation and proposes the concept of vocabulary coverage

Methodology Details

Task Definition

The research objective is to understand how different factors in transliteration affect multilingual model performance on unseen languages. Input consists of text processed by different transliteration methods, with output being downstream task performance.

Three Key Factors

1. Shared Character Set

  • Definition: Transliteration reduces the unique characters and patterns the tokenizer must capture through unified character sets
  • Function: Significantly reduces the proportion of unknown tokens (UNK)

2. Shared Token Set

  • Definition: Transliteration produces cross-lingual shared subword tokens (length > 1)
  • Importance: Character sequences are more likely to contain semantic information than individual characters

3. Shared Phonology

  • Definition: The degree to which transliteration methods encode phonological information
  • Function: Makes phonetically similar words have similar representations, enabling identification of cognates and loanwords

Four Input Types

Input TypeShared Character SetShared Token SetShared Phonology
Ortho (Orthography)---
IPA (International Phonetic Alphabet)±±+
Rom (Romanization)++±
Cipher (Substitution Cipher)+--

IPA Conversion

  • Employs rule-based G2P conversion using the Epitran tool
  • Supports over 100 languages, ensuring consistency and practicality
  • Although based on Latin script, differences in phoneme inventories across languages result in partial sharing of character and token sets

Romanization (Rom)

  • Uses the Uroman tool to convert various scripts to Latin characters
  • Preserves original forms for Latin script languages
  • Encodes phonetic information but less precisely than IPA

Substitution Cipher

  • Applies Caesar cipher to romanized text
  • Uses different shift rules for each language
  • Removes phonological information while maintaining character set sharing

Language Selection Strategy

Based on lang2vec language similarity computation, four language sets are constructed:

  • sim-same: Similar languages + same script
  • sim-div: Similar languages + different scripts
  • dissim-same: Dissimilar languages + same script
  • dissim-div: Dissimilar languages + different scripts

Similarity integrates syntactic, geographic, genetic, and lexical features.

Experimental Setup

Datasets

  • Pretraining: Wikipedia corpus, limited to approximately 10 million words per language
  • Downstream Tasks:
    • NER: WikiAnn dataset
    • NLI: XNLI dataset

Model Configuration

  • Architecture: Transformer encoder based on XLM-R
  • Parameters: Approximately 109 million parameters
  • Vocabulary Size: 30K (SentencePiece BPE)
  • Training: Pretrains 16 models from scratch (4 input types × 4 language sets)

Vocabulary Overlap Analysis

Overlap ratio calculation formula: OverlapRatio(lt,Ls)=maxlLsSlSltSlt\text{OverlapRatio}(l_t, L_s) = \max_{l \in L_s} \frac{|S_l \cap S_{l_t}|}{|S_{l_t}|}

Length-decomposed overlap ratio: {xSlsSltlen(x)=m}Slt\frac{|\{x \in S_{l_s} \cap S_{l_t} | \text{len}(x) = m\}|}{|S_{l_t}|}

Experimental Results

Main Results

NER Task Performance

  • Unseen Languages: Rom significantly outperforms other methods across all language sets
  • Seen Languages: Rom performs comparably to Ortho
  • Statistical Significance: Rom vs. other input types, p < 0.05

NLI Task Performance

  • Unseen Languages: All transliteration methods outperform Ortho, with Rom performing best
  • Seen Languages: No significant differences between input types

Key Findings

  1. UNK Token Correlation: Strong negative correlation between UNK proportion in unseen languages and performance
  2. Transliteration Benefits: Primarily manifest in languages using unseen scripts
  3. Consistency: Rom performs best in 7/8 evaluation settings

In-depth Analysis

1. Role of Shared Character Sets

  • Transliteration dramatically reduces UNK proportion through unified character space
  • Cipher achieves significant gains from character sharing alone despite lacking semantic information
  • Negative correlation between UNK proportion and F1 score

2. Importance of Token Length

Core Finding:

  • Short token overlap (including single characters) correlates negatively with performance
  • Long token overlap correlates positively with performance
  • Rom produces the most long tokens, explaining its superior performance

Vocabulary Coverage Analysis:

  • Rom achieves highest coverage on tokens of length 2-4
  • Better vocabulary space utilization enhances model capacity
  • Vocabulary coverage better explains performance differences than tokenizer fertility

3. Mediated Role of Shared Phonology

  • Cipher, lacking phonological information, struggles to produce long tokens
  • IPA, despite more UNK tokens, produces longer shared tokens on unseen languages
  • Shared phonology promotes long token formation through consistent form-meaning mappings

Script Barrier Research

  • Large-scale multilingual models face challenges processing unseen/underrepresented scripts
  • Transliteration gains attention as an effective means for improving cross-lingual transfer

Transliteration Methods

  • Romanization: Leverages the dominance of Latin scripts in pretrained models
  • G2P Conversion: Converts text to IPA phonemic representation
  • Existing Limitations: Primarily focus on similar languages, lacking typological diversity analysis

Vocabulary Overlap Research

  • Shared vocabulary/subword units allow models to reuse learned representations
  • High UNK token proportions impede transfer and reduce downstream performance
  • This study provides finer-grained analysis through length decomposition

Conclusions and Discussion

Main Conclusions

  1. Romanization Optimal: Significantly outperforms other transliteration methods in most settings
  2. Long Tokens Critical: Sharing longer tokens is more important than character-level overlap
  3. Mechanism Explanation: Transliteration reshapes token distributions, making multilingual models more adaptive

Limitations

  1. Model Scope: Tests only one Transformer model and subword tokenization scheme
  2. Tool Dependency: Results may be affected by performance of specific romanization and G2P tools
  3. Evaluation Range: May require validation on character-level or byte-level models

Future Directions

  1. Extend to different model architectures and tokenization schemes
  2. Explore impacts of alternative transliteration tools
  3. Investigate effects of token length distribution on different tasks

In-depth Evaluation

Strengths

  1. Theoretical Contribution: First systematic decomposition of key factors in transliteration effectiveness
  2. Experimental Design: Rigorous controlled experimental design with clear variable control
  3. Analysis Depth: Length-decomposed vocabulary overlap analysis provides novel insights
  4. Practical Value: Provides guidance for transliteration method selection in multilingual NLP

Weaknesses

  1. Scope Limitation: Evaluated on only two tasks; generalizability requires verification
  2. Language Coverage: While typologically diverse, the number of languages is relatively limited
  3. Theoretical Explanation: Theoretical explanation for why longer tokens are more effective lacks depth

Impact

  1. Academic Contribution: Provides new analytical framework for transliteration research
  2. Practical Value: Guides multilingual model applications for low-resource languages
  3. Reproducibility: Detailed method and experimental setup descriptions facilitate reproduction

Applicable Scenarios

  1. Multilingual NLP: Particularly suitable for applications involving non-Latin scripts
  2. Low-Resource Languages: Provides effective transfer learning strategies for resource-scarce languages
  3. Cross-lingual Information Retrieval: Unified representations facilitate cross-lingual matching

References

The paper cites multiple important works, including:

  • XLM-R (Conneau et al., 2020): Multilingual pretrained model
  • Epitran (Mortensen et al., 2018): G2P conversion tool
  • Uroman (Hermjakob et al., 2018): Universal romanization tool
  • WikiAnn (Pan et al., 2017): Multilingual NER dataset

Through systematic controlled experiments and in-depth analysis, this research provides important insights into the mechanisms of transliteration in multilingual NLP. Particularly, the discovery of the critical role of shared long tokens for cross-lingual adaptation makes valuable contributions to both theoretical development and practical applications in the field.