2025-11-11T11:52:09.364797

Hebrew Diacritics Restoration using Visual Representation

Elboher, Pinter
Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input's vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.
academic

Hebrew Diacritics Restoration using Visual Representation

Basic Information

  • Paper ID: 2510.26521
  • Title: Hebrew Diacritics Restoration using Visual Representation
  • Authors: Yair Elboher, Yuval Pinter (Ben-Gurion University of the Negev)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: November 3, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2510.26521v2

Abstract

Hebrew diacritics restoration is a fundamental task for ensuring accurate pronunciation and disambiguating text. Although undiacritized Hebrew exhibits high ambiguity, recent machine learning approaches have significantly improved performance on this task. This paper proposes DIVRIT, a novel system that reframes Hebrew diacritization as a zero-shot classification problem. The method operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on surrounding textual context. DIVRIT's key innovation is the use of a Hebrew visual language model that processes undiacritized text as images, enabling diacritization information to be directly embedded in the vector representations of inputs.

Research Background and Motivation

Problem Definition

Hebrew, as a representative Semitic language, primarily represents consonants. The absence of diacritical marks (niqqud) leads to severe lexical ambiguity. For example, the consonant string "mlk" can be interpreted as "king" (melekh), "reigned" (malakh), and other meanings depending on context.

Problem Significance

  1. Practical Value: Automatic diacritization is crucial for digital text accessibility and human-computer interaction
  2. Linguistic Complexity: Accurate diacritics restoration requires syntactic and semantic understanding
  3. Technical Challenge: As a morphologically rich language, Hebrew diacritization rules are complex, requiring extraction of gender, tense, part-of-speech, and other information

Limitations of Existing Methods

  1. Dicta's Nakdan: Combines deep learning with linguistic rules; high accuracy but limited generalization
  2. Nakdimon: Pure data-driven character-level Bi-LSTM approach
  3. MenakBERT: Transformer-based character-level pre-trained method

Existing systems primarily operate at the character level, whereas Hebrew morphology is primarily controlled by word-level templates, suggesting that word-level analysis is more suitable for this task.

Core Contributions

  1. Novel Approach: Proposes the first word-level system that reframes Hebrew diacritization as a zero-shot classification problem
  2. Visual Language Model: Develops a Hebrew visual language model based on Vision Transformer that learns diacritization patterns directly from images
  3. Candidate Generation Mechanism: Designs a KNN-based candidate generation algorithm that dynamically generates diacritization candidates for each word
  4. Performance Breakthrough: Achieves 92.68% word-level accuracy in Oracle setting and 87.87% in KNN setting

Methodology Details

Task Definition

Input: Undiacritized Hebrew text Output: Selection of the most appropriate diacritization pattern for each word Constraint: Selection from dynamically generated candidate set, conditioned on context

Model Architecture

DIVRIT employs a dual-encoder architecture:

1. Candidate Encoder

  • Visual encoder based on PIXEL-base model
  • Processes diacritization candidates rendered as images
  • Generates candidate-specific embedding representations

2. Context Encoder

  • Uses ALEPHBERTGIMMEL-SMALL Hebrew language model
  • Extracts contextual embeddings of undiacritized words
  • Provides semantic and syntactic contextual information

3. Scoring Mechanism

Computes similarity between candidate embeddings and context embeddings via dot product:

score(candidate, context) = embedding_candidate · embedding_context

Technical Innovations

1. Visual Representation Learning

  • Treats diacritics as visual elements, avoiding explicit vocabulary assignment
  • Uses masked image modeling objective to pre-train Hebrew PIXEL model
  • Performs additional pre-training on diacritized text, reducing masking ratio from 0.25 to 0.1

2. Candidate Generation Algorithm

KNN-based candidate generation mechanism:

  • Parameter k: number of similar words considered
  • Parameter c: maximum size of returned candidate set
  • Computes similarity based on character-level matching and position alignment
  • Leverages root-template morphological features of Semitic languages

3. Zero-Shot Learning Framework

  • Each candidate serves as an independent class
  • Selects the most appropriate class through learning discriminative representations
  • Generalizes to unseen classes without task-specific training

Experimental Setup

Datasets

  1. Pre-training Data:
    • Hebrew Wikipedia: approximately 1.9GB
    • OSCAR Hebrew portion: approximately 9.8GB
    • Filters samples with fewer than 30 characters
  2. Diacritization Data:
    • Gershuni and Pinter (2022) dataset
    • Approximately 3.4 million tokens of original diacritized Hebrew text
    • Includes Modern Hebrew, pre-Modern Hebrew, and automatically diacritized text
  3. Test Set:
    • 20K tokens from multiple Modern Hebrew sources

Evaluation Metrics

  • WOR: Word-level accuracy
  • CHA: Character-level accuracy
  • DEC: Diacritical decision accuracy
  • VOC: Word-level pronunciation preservation rate

Comparison Methods

  • Baseline Methods: Majority class prediction baseline, KNN baseline
  • Data-Driven Systems: Nakdimon, MenakBERT
  • Hybrid System: Dicta's Nakdan

Implementation Details

  • Pre-training: 2M steps, batch size 128, 4 × 48GB Nvidia RTX6000 GPUs
  • Fine-tuning: 240K steps, batch size 32, 2 GPUs
  • Uses PangoCairo renderer and Noto Sans Hebrew font
  • All text images horizontally mirrored at instance level due to Hebrew's right-to-left writing direction

Experimental Results

Main Results

SystemDECCHAWORVOC
MAJORITY BASELINE93.7990.0184.8786.19
KNN BASELINE96.2094.0987.0987.39
NAKDIMON97.9196.3789.7591.64
MENAKBERT98.8297.9594.1295.22
DIVRIT (Oracle)98.3697.4292.6894.69
DIVRIT (KNN-based)96.8595.0387.8790.38
DICTA98.9498.2395.8395.93

Ablation Studies

1. Impact of Candidate Quantity

  • Two-candidate selection: 91.45% WOR accuracy
  • Three-candidate selection: 74.16% WOR accuracy
  • Increased candidate quantity leads to performance degradation, indicating insufficiencies in the scoring mechanism

2. Fine-tuning Duration

  • 140K steps: 90.54% WOR accuracy
  • 240K steps: 91.45% WOR accuracy
  • Extended fine-tuning significantly improves performance

3. Auxiliary Tasks

Diacritics Bag Prediction Auxiliary Task:

L(w,C,cgt) = CELoss(P(c|w), one_hot(cgt)) + 
             0.5/Ncands * Σ BCELoss(ydiac(ci), ytarget_diac(ci))
  • Two candidates: improvement from 90.54% to 91.41%
  • Three candidates: degradation from 73.55% to 71.49%

4. RTL Image Processing

  • Two candidates: 88.60% WOR accuracy
  • Three candidates: 84.93% WOR accuracy
  • Mirroring processing significantly improves generalization in multi-candidate scenarios

Experimental Findings

  1. Validity of Visual Representation: DIVRIT demonstrates the potential of visual representations in Hebrew diacritization
  2. Importance of Candidate Generation: Performance gap between Oracle and KNN settings highlights the importance of improving candidate generation
  3. Generalization Challenge: Model generalization capability degrades with increasing candidate quantity
  4. Context Encoder Selection: Text-based context encoder outperforms pure visual approaches

Hebrew Diacritization Development

  1. Hybrid Methods: Dicta's Nakdan combines deep learning with manual rules
  2. Pure Data-Driven: Nakdimon uses Bi-LSTM, MenakBERT uses Transformer
  3. Character-level vs. Word-level: Existing methods predominantly employ character-level prediction; this paper is the first to propose word-level candidate selection

Zero-Shot Learning

  • Success of large-scale language models like GPT-3 in multi-task zero-shot learning
  • Application of CLIP and ALIGN in vision-language zero-shot classification
  • First application of zero-shot learning to diacritization task

Vision-Language Models

  • Success of Vision Transformer in computer vision tasks
  • Robustness of PIXEL model in multilingual text processing
  • First application of ViT to candidate ranking task

Conclusions and Discussion

Main Conclusions

  1. DIVRIT successfully reframes Hebrew diacritization as a zero-shot classification problem
  2. Visual representations effectively capture diacritization patterns without requiring complex linguistic analysis
  3. Achieves competitive performance with existing methods in Oracle setting
  4. Word-level approach is more suitable than character-level approaches for Hebrew diacritization

Limitations

  1. Candidate Generation Dependency: System still relies on data-driven candidate generation methods
  2. Context Encoder: Optimal configuration still employs text-based context encoder
  3. Multi-Candidate Generalization: Performance significantly degrades with increasing candidate quantity
  4. Language Specificity: Developed on Hebrew; application to other languages may face challenges

Future Directions

  1. Improved Candidate Generation: Develop more precise candidate generation algorithms
  2. Multilingual Extension: Apply methodology to other diacritics-rich languages such as Arabic and Vietnamese
  3. Architecture Optimization: Explore larger-scale model architectures and extended pre-training processes
  4. Multimodal Integration: Further optimize integration of visual and contextual information

In-Depth Evaluation

Strengths

  1. Methodological Innovation: First to frame diacritization as a zero-shot classification problem, demonstrating originality
  2. Technical Sophistication: Cleverly combines visual language models with traditional NLP methods
  3. Comprehensive Experimentation: Conducts thorough ablation studies and architecture comparisons
  4. Theoretical Contribution: Demonstrates validity of visual representations in morphological tasks

Weaknesses

  1. Performance Gap: Still does not surpass existing best methods in practical application scenarios
  2. Computational Complexity: Dual-encoder architecture may introduce additional computational overhead
  3. Simple Candidate Generation: KNN-based method is relatively simple, potentially limiting system potential
  4. Generalization Capability: Performance degradation in multi-candidate scenarios indicates limited model generalization

Impact

  1. Domain Contribution: Provides new research paradigm for diacritization tasks
  2. Technical Inspiration: Demonstrates potential of visual methods in NLP tasks
  3. Practical Value: Provides new tool options for Hebrew text processing
  4. Reproducibility: Commits to releasing code and data, facilitating subsequent research

Applicable Scenarios

  1. Hebrew Text Processing: Digital libraries, educational software, etc.
  2. Multilingual Systems: Extensible to other Semitic languages
  3. Visual Text Processing: OCR post-processing, historical document digitization, etc.
  4. Research Tools: Provides automated tools for linguistic research

References

The paper cites extensive related work, including:

  • Gershuni and Pinter (2022): Nakdimon system
  • Cohen et al. (2024): MenakBERT system
  • Shmidman et al. (2020): Dicta's Nakdan system
  • Rust et al. (2023): PIXEL model
  • He et al. (2022): Vision Transformer architecture

Overall Assessment: This is an innovative research paper that for the first time applies visual language models to Hebrew diacritization tasks and proposes a new zero-shot classification framework. Although performance in certain settings has not yet surpassed existing methods, its pioneering approach and comprehensive experimental validation provide valuable contributions and new research directions for the field.