2025-11-25T03:46:17.872017

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Jung, Kim, Kim et al.

Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.

academic

Basic Information

Paper ID: 2510.10827
Title: Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
Authors: Haeji Jung, Jinju Kim, Kyungjin Kim, Youjeong Roh, David R. Mortensen
Classification: cs.CL cs.AI
Publication Date: October 12, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10827

Abstract

Transliteration has emerged as a promising approach to bridge linguistic gaps in multilingual NLP, demonstrating particular effectiveness for languages employing non-Latin scripts. This study investigates the degree to which shared scripts, overlapping vocabularies, and shared phonology contribute to multilingual model performance. Through controlled experiments employing three transliteration methods (romanization, phonemic transcription, and substitution cipher) alongside orthography, models are evaluated on two downstream tasks: Named Entity Recognition (NER) and Natural Language Inference (NLI). Results demonstrate that romanization significantly outperforms other input types in 7 of 8 evaluation settings, largely consistent with the authors' hypotheses. Further analysis reveals that sharing longer (subword) tokens with pretraining languages enables better utilization of model capacity.

Research Background and Motivation

Core Problem

The core problem addressed by this research is the script barrier phenomenon: multilingual models struggle to share knowledge across languages with different writing systems due to mismatched input representations.

Problem Significance

Multilingual Fairness: Most pretrained language models are primarily trained on Latin scripts, providing insufficient support for non-Latin script languages
Knowledge Transfer Barriers: Even in large-scale multilingual models, knowledge sharing across different writing systems remains challenging
Resource Imbalance: Non-Latin script languages typically have fewer resources and require improved cross-lingual transfer methods

Limitations of Existing Approaches

Lack of Systematic Analysis: While transliteration methods (e.g., romanization, phonemic conversion) are effective in practice, the reasons for their effectiveness lack deep investigation
Factor Confounding: Existing research fails to clearly isolate the contributions of different factors in transliteration
Limited Evaluation Scope: Most studies focus on similar languages (e.g., Indo-European families), lacking typological diversity

Research Motivation

The authors pose a central question: Is it the shared script itself or the linguistic information encoded within the script that helps models adapt to other languages?

Core Contributions

Theoretical Framework: Defines three key factors for transliteration effectiveness—shared character sets, shared token sets, and shared phonology
Systematic Experiments: Conducts controlled pretraining experiments across four language sets and four input types
In-depth Analysis: Reveals mechanisms through which different transliteration methods produce different overlap patterns via vocabulary overlap analysis
Important Findings: Demonstrates the critical role of sharing longer tokens for cross-lingual adaptation and proposes the concept of vocabulary coverage

Methodology Details

Task Definition

The research objective is to understand how different factors in transliteration affect multilingual model performance on unseen languages. Input consists of text processed by different transliteration methods, with output being downstream task performance.

Three Key Factors

1. Shared Character Set

Definition: Transliteration reduces the unique characters and patterns the tokenizer must capture through unified character sets
Function: Significantly reduces the proportion of unknown tokens (UNK)

2. Shared Token Set

Definition: Transliteration produces cross-lingual shared subword tokens (length > 1)
Importance: Character sequences are more likely to contain semantic information than individual characters

3. Shared Phonology

Definition: The degree to which transliteration methods encode phonological information
Function: Makes phonetically similar words have similar representations, enabling identification of cognates and loanwords

Four Input Types

Input Type	Shared Character Set	Shared Token Set	Shared Phonology
Ortho (Orthography)	-	-	-
IPA (International Phonetic Alphabet)	±	±	+
Rom (Romanization)	+	+	±
Cipher (Substitution Cipher)	+	-	-

IPA Conversion

Employs rule-based G2P conversion using the Epitran tool
Supports over 100 languages, ensuring consistency and practicality
Although based on Latin script, differences in phoneme inventories across languages result in partial sharing of character and token sets

Romanization (Rom)

Uses the Uroman tool to convert various scripts to Latin characters
Preserves original forms for Latin script languages
Encodes phonetic information but less precisely than IPA

Substitution Cipher

Applies Caesar cipher to romanized text
Uses different shift rules for each language
Removes phonological information while maintaining character set sharing

Language Selection Strategy

Based on lang2vec language similarity computation, four language sets are constructed:

sim-same: Similar languages + same script
sim-div: Similar languages + different scripts
dissim-same: Dissimilar languages + same script
dissim-div: Dissimilar languages + different scripts

Similarity integrates syntactic, geographic, genetic, and lexical features.

Experimental Setup

Datasets

Pretraining: Wikipedia corpus, limited to approximately 10 million words per language
Downstream Tasks:
- NER: WikiAnn dataset
- NLI: XNLI dataset

Model Configuration

Architecture: Transformer encoder based on XLM-R
Parameters: Approximately 109 million parameters
Vocabulary Size: 30K (SentencePiece BPE)
Training: Pretrains 16 models from scratch (4 input types × 4 language sets)

Vocabulary Overlap Analysis

Overlap ratio calculation formula: $\text{OverlapRatio}(l_t, L_s) = \max_{l \in L_s} \frac{|S_l \cap S_{l_t}|}{|S_{l_t}|}$

Length-decomposed overlap ratio: $\frac{|\{x \in S_{l_s} \cap S_{l_t} | \text{len}(x) = m\}|}{|S_{l_t}|}$

Experimental Results

Main Results

NER Task Performance

Unseen Languages: Rom significantly outperforms other methods across all language sets
Seen Languages: Rom performs comparably to Ortho
Statistical Significance: Rom vs. other input types, p < 0.05

NLI Task Performance

Unseen Languages: All transliteration methods outperform Ortho, with Rom performing best
Seen Languages: No significant differences between input types

Key Findings

UNK Token Correlation: Strong negative correlation between UNK proportion in unseen languages and performance
Transliteration Benefits: Primarily manifest in languages using unseen scripts
Consistency: Rom performs best in 7/8 evaluation settings

In-depth Analysis

1. Role of Shared Character Sets

Transliteration dramatically reduces UNK proportion through unified character space
Cipher achieves significant gains from character sharing alone despite lacking semantic information
Negative correlation between UNK proportion and F1 score

2. Importance of Token Length

Core Finding:

Short token overlap (including single characters) correlates negatively with performance
Long token overlap correlates positively with performance
Rom produces the most long tokens, explaining its superior performance

Vocabulary Coverage Analysis:

Rom achieves highest coverage on tokens of length 2-4
Better vocabulary space utilization enhances model capacity
Vocabulary coverage better explains performance differences than tokenizer fertility

3. Mediated Role of Shared Phonology

Cipher, lacking phonological information, struggles to produce long tokens
IPA, despite more UNK tokens, produces longer shared tokens on unseen languages
Shared phonology promotes long token formation through consistent form-meaning mappings

Script Barrier Research

Large-scale multilingual models face challenges processing unseen/underrepresented scripts
Transliteration gains attention as an effective means for improving cross-lingual transfer

Transliteration Methods

Romanization: Leverages the dominance of Latin scripts in pretrained models
G2P Conversion: Converts text to IPA phonemic representation
Existing Limitations: Primarily focus on similar languages, lacking typological diversity analysis

Vocabulary Overlap Research

Shared vocabulary/subword units allow models to reuse learned representations
High UNK token proportions impede transfer and reduce downstream performance
This study provides finer-grained analysis through length decomposition

Conclusions and Discussion

Main Conclusions

Romanization Optimal: Significantly outperforms other transliteration methods in most settings
Long Tokens Critical: Sharing longer tokens is more important than character-level overlap
Mechanism Explanation: Transliteration reshapes token distributions, making multilingual models more adaptive

Limitations

Model Scope: Tests only one Transformer model and subword tokenization scheme
Tool Dependency: Results may be affected by performance of specific romanization and G2P tools
Evaluation Range: May require validation on character-level or byte-level models

Future Directions

Extend to different model architectures and tokenization schemes
Explore impacts of alternative transliteration tools
Investigate effects of token length distribution on different tasks

In-depth Evaluation

Strengths

Theoretical Contribution: First systematic decomposition of key factors in transliteration effectiveness
Experimental Design: Rigorous controlled experimental design with clear variable control
Analysis Depth: Length-decomposed vocabulary overlap analysis provides novel insights
Practical Value: Provides guidance for transliteration method selection in multilingual NLP

Weaknesses

Scope Limitation: Evaluated on only two tasks; generalizability requires verification
Language Coverage: While typologically diverse, the number of languages is relatively limited
Theoretical Explanation: Theoretical explanation for why longer tokens are more effective lacks depth

Impact

Academic Contribution: Provides new analytical framework for transliteration research
Practical Value: Guides multilingual model applications for low-resource languages
Reproducibility: Detailed method and experimental setup descriptions facilitate reproduction

Applicable Scenarios

Multilingual NLP: Particularly suitable for applications involving non-Latin scripts
Low-Resource Languages: Provides effective transfer learning strategies for resource-scarce languages
Cross-lingual Information Retrieval: Unified representations facilitate cross-lingual matching

References

The paper cites multiple important works, including:

XLM-R (Conneau et al., 2020): Multilingual pretrained model
Epitran (Mortensen et al., 2018): G2P conversion tool
Uroman (Hermjakob et al., 2018): Universal romanization tool
WikiAnn (Pan et al., 2017): Multilingual NER dataset

Through systematic controlled experiments and in-depth analysis, this research provides important insights into the mechanisms of transliteration in multilingual NLP. Particularly, the discovery of the critical role of shared long tokens for cross-lingual adaptation makes valuable contributions to both theoretical development and practical applications in the field.