2025-11-11T11:52:09.364797

Hebrew Diacritics Restoration using Visual Representation

Elboher, Pinter

Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.

academic

Hebrew Diacritics Restoration using Visual Representation

Basic Information

Paper ID: 2510.26521
Title: Hebrew Diacritics Restoration using Visual Representation
Authors: Yair Elboher, Yuval Pinter (Ben-Gurion University of the Negev)
Classification: cs.CL (Computational Linguistics)
Publication Date: November 3, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2510.26521v2

Abstract

Hebrew diacritics restoration is a fundamental task for ensuring accurate pronunciation and disambiguating text. Although undiacritized Hebrew exhibits high ambiguity, recent machine learning approaches have significantly improved performance on this task. This paper proposes DIVRIT, a novel system that reframes Hebrew diacritization as a zero-shot classification problem. The method operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on surrounding textual context. DIVRIT's key innovation is the use of a Hebrew visual language model that processes undiacritized text as images, enabling diacritization information to be directly embedded in the vector representations of inputs.

Research Background and Motivation

Problem Definition

Hebrew, as a representative Semitic language, primarily represents consonants. The absence of diacritical marks (niqqud) leads to severe lexical ambiguity. For example, the consonant string "mlk" can be interpreted as "king" (melekh), "reigned" (malakh), and other meanings depending on context.

Problem Significance

Practical Value: Automatic diacritization is crucial for digital text accessibility and human-computer interaction
Linguistic Complexity: Accurate diacritics restoration requires syntactic and semantic understanding
Technical Challenge: As a morphologically rich language, Hebrew diacritization rules are complex, requiring extraction of gender, tense, part-of-speech, and other information

Limitations of Existing Methods

Dicta's Nakdan: Combines deep learning with linguistic rules; high accuracy but limited generalization
Nakdimon: Pure data-driven character-level Bi-LSTM approach
MenakBERT: Transformer-based character-level pre-trained method

Existing systems primarily operate at the character level, whereas Hebrew morphology is primarily controlled by word-level templates, suggesting that word-level analysis is more suitable for this task.

Core Contributions

Novel Approach: Proposes the first word-level system that reframes Hebrew diacritization as a zero-shot classification problem
Visual Language Model: Develops a Hebrew visual language model based on Vision Transformer that learns diacritization patterns directly from images
Candidate Generation Mechanism: Designs a KNN-based candidate generation algorithm that dynamically generates diacritization candidates for each word
Performance Breakthrough: Achieves 92.68% word-level accuracy in Oracle setting and 87.87% in KNN setting

Methodology Details

Task Definition

Input: Undiacritized Hebrew text Output: Selection of the most appropriate diacritization pattern for each word Constraint: Selection from dynamically generated candidate set, conditioned on context

Model Architecture

DIVRIT employs a dual-encoder architecture:

1. Candidate Encoder

Visual encoder based on PIXEL-base model
Processes diacritization candidates rendered as images
Generates candidate-specific embedding representations

2. Context Encoder

Uses ALEPHBERTGIMMEL-SMALL Hebrew language model
Extracts contextual embeddings of undiacritized words
Provides semantic and syntactic contextual information

3. Scoring Mechanism

Computes similarity between candidate embeddings and context embeddings via dot product:

score(candidate, context) = embedding_candidate · embedding_context

Technical Innovations

1. Visual Representation Learning

Treats diacritics as visual elements, avoiding explicit vocabulary assignment
Uses masked image modeling objective to pre-train Hebrew PIXEL model
Performs additional pre-training on diacritized text, reducing masking ratio from 0.25 to 0.1

2. Candidate Generation Algorithm

KNN-based candidate generation mechanism:

Parameter k: number of similar words considered
Parameter c: maximum size of returned candidate set
Computes similarity based on character-level matching and position alignment
Leverages root-template morphological features of Semitic languages

3. Zero-Shot Learning Framework

Each candidate serves as an independent class
Selects the most appropriate class through learning discriminative representations
Generalizes to unseen classes without task-specific training

Experimental Setup

Datasets

Pre-training Data:
- Hebrew Wikipedia: approximately 1.9GB
- OSCAR Hebrew portion: approximately 9.8GB
- Filters samples with fewer than 30 characters
Diacritization Data:
- Gershuni and Pinter (2022) dataset
- Approximately 3.4 million tokens of original diacritized Hebrew text
- Includes Modern Hebrew, pre-Modern Hebrew, and automatically diacritized text
Test Set:
- 20K tokens from multiple Modern Hebrew sources

Evaluation Metrics

WOR: Word-level accuracy
CHA: Character-level accuracy
DEC: Diacritical decision accuracy
VOC: Word-level pronunciation preservation rate

Comparison Methods

Baseline Methods: Majority class prediction baseline, KNN baseline
Data-Driven Systems: Nakdimon, MenakBERT
Hybrid System: Dicta's Nakdan

Implementation Details

Pre-training: 2M steps, batch size 128, 4 × 48GB Nvidia RTX6000 GPUs
Fine-tuning: 240K steps, batch size 32, 2 GPUs
Uses PangoCairo renderer and Noto Sans Hebrew font
All text images horizontally mirrored at instance level due to Hebrew's right-to-left writing direction

Experimental Results

Main Results

System	DEC	CHA	WOR	VOC
MAJORITY BASELINE	93.79	90.01	84.87	86.19
KNN BASELINE	96.20	94.09	87.09	87.39
NAKDIMON	97.91	96.37	89.75	91.64
MENAKBERT	98.82	97.95	94.12	95.22
DIVRIT (Oracle)	98.36	97.42	92.68	94.69
DIVRIT (KNN-based)	96.85	95.03	87.87	90.38
DICTA	98.94	98.23	95.83	95.93

Ablation Studies

1. Impact of Candidate Quantity

Two-candidate selection: 91.45% WOR accuracy
Three-candidate selection: 74.16% WOR accuracy
Increased candidate quantity leads to performance degradation, indicating insufficiencies in the scoring mechanism

2. Fine-tuning Duration

140K steps: 90.54% WOR accuracy
240K steps: 91.45% WOR accuracy
Extended fine-tuning significantly improves performance

3. Auxiliary Tasks

Diacritics Bag Prediction Auxiliary Task:

L(w,C,cgt) = CELoss(P(c|w), one_hot(cgt)) + 
             0.5/Ncands * Σ BCELoss(ydiac(ci), ytarget_diac(ci))

Two candidates: improvement from 90.54% to 91.41%
Three candidates: degradation from 73.55% to 71.49%

4. RTL Image Processing

Two candidates: 88.60% WOR accuracy
Three candidates: 84.93% WOR accuracy
Mirroring processing significantly improves generalization in multi-candidate scenarios

Experimental Findings

Validity of Visual Representation: DIVRIT demonstrates the potential of visual representations in Hebrew diacritization
Importance of Candidate Generation: Performance gap between Oracle and KNN settings highlights the importance of improving candidate generation
Generalization Challenge: Model generalization capability degrades with increasing candidate quantity
Context Encoder Selection: Text-based context encoder outperforms pure visual approaches

Hebrew Diacritization Development

Hybrid Methods: Dicta's Nakdan combines deep learning with manual rules
Pure Data-Driven: Nakdimon uses Bi-LSTM, MenakBERT uses Transformer
Character-level vs. Word-level: Existing methods predominantly employ character-level prediction; this paper is the first to propose word-level candidate selection

Zero-Shot Learning

Success of large-scale language models like GPT-3 in multi-task zero-shot learning
Application of CLIP and ALIGN in vision-language zero-shot classification
First application of zero-shot learning to diacritization task

Vision-Language Models

Success of Vision Transformer in computer vision tasks
Robustness of PIXEL model in multilingual text processing
First application of ViT to candidate ranking task

Conclusions and Discussion

Main Conclusions

DIVRIT successfully reframes Hebrew diacritization as a zero-shot classification problem
Visual representations effectively capture diacritization patterns without requiring complex linguistic analysis
Achieves competitive performance with existing methods in Oracle setting
Word-level approach is more suitable than character-level approaches for Hebrew diacritization

Limitations

Candidate Generation Dependency: System still relies on data-driven candidate generation methods
Context Encoder: Optimal configuration still employs text-based context encoder
Multi-Candidate Generalization: Performance significantly degrades with increasing candidate quantity
Language Specificity: Developed on Hebrew; application to other languages may face challenges

Future Directions

Improved Candidate Generation: Develop more precise candidate generation algorithms
Multilingual Extension: Apply methodology to other diacritics-rich languages such as Arabic and Vietnamese
Architecture Optimization: Explore larger-scale model architectures and extended pre-training processes
Multimodal Integration: Further optimize integration of visual and contextual information

In-Depth Evaluation

Strengths

Methodological Innovation: First to frame diacritization as a zero-shot classification problem, demonstrating originality
Technical Sophistication: Cleverly combines visual language models with traditional NLP methods
Comprehensive Experimentation: Conducts thorough ablation studies and architecture comparisons
Theoretical Contribution: Demonstrates validity of visual representations in morphological tasks

Weaknesses

Performance Gap: Still does not surpass existing best methods in practical application scenarios
Computational Complexity: Dual-encoder architecture may introduce additional computational overhead
Simple Candidate Generation: KNN-based method is relatively simple, potentially limiting system potential
Generalization Capability: Performance degradation in multi-candidate scenarios indicates limited model generalization

Impact

Domain Contribution: Provides new research paradigm for diacritization tasks
Technical Inspiration: Demonstrates potential of visual methods in NLP tasks
Practical Value: Provides new tool options for Hebrew text processing
Reproducibility: Commits to releasing code and data, facilitating subsequent research

Applicable Scenarios

Hebrew Text Processing: Digital libraries, educational software, etc.
Multilingual Systems: Extensible to other Semitic languages
Visual Text Processing: OCR post-processing, historical document digitization, etc.
Research Tools: Provides automated tools for linguistic research

References

The paper cites extensive related work, including:

Gershuni and Pinter (2022): Nakdimon system
Cohen et al. (2024): MenakBERT system
Shmidman et al. (2020): Dicta's Nakdan system
Rust et al. (2023): PIXEL model
He et al. (2022): Vision Transformer architecture

Overall Assessment: This is an innovative research paper that for the first time applies visual language models to Hebrew diacritization tasks and proposes a new zero-shot classification framework. Although performance in certain settings has not yet surpassed existing methods, its pioneering approach and comprehensive experimental validation provide valuable contributions and new research directions for the field.