2025-11-13T02:34:15.167959

A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics

Sharma, Goyal, Goyal et al.
Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.
academic

A Fully Automated and Scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics

Basic Information

  • Paper ID: 2510.13211
  • Title: A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics
  • Authors: Prawaal Sharma (Infosys), Navneet Goyal (BITS Pilani), Poonam Goyal (BITS Pilani), Vishnupriyan K R (Infosys)
  • Classification: cs.CL (Computational Linguistics)
  • Conference: SAC '23 (The 38th ACM/SIGAPP Symposium on Applied Computing), March 27-31, 2023, Tallinn, Estonia
  • Paper Link: https://arxiv.org/abs/2510.13211

Abstract

Global linguistic diversity has created disparities in the availability of high-quality digital language resources, thereby limiting technological advantages for most populations. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper proposes a novel, scalable, and fully automated approach to extract bilingual parallel corpora from newspaper articles using image and text analytics. The authors validate the approach by constructing parallel data corpora for two different language pairs and demonstrate the value of the dataset through machine translation downstream tasks, achieving approximately 3 BLEU points improvement over current baselines.

Research Background and Motivation

Problem Definition

  1. Core Issue: Of the 7,000 languages globally, only 20 have sufficient resources on the internet, with the remainder classified as low-resource languages (LRLs), lacking digital data support
  2. Scope of Impact: Over 2.5 billion people use 2,000 low-resource languages, primarily distributed in India and Africa
  3. Technical Barriers: Modern NLP tasks require large amounts of training data, and the scarcity of digital data in low-resource languages is the primary challenge in popularizing NLP technology to the masses

Research Motivation

  • Construct parallel corpora for low-resource languages, particularly for low-resource to high-resource language pairs
  • Select Konkani-Marathi as the primary example: Konkani is a typical low-resource language with scarce digital resources and fewer native speakers; Marathi is resource-rich
  • Observe that local newspapers from major publishers reuse images across different language versions to optimize resources

Core Contributions

  1. Innovative Methodology: First to use newspaper article images as a hub for article mapping, which has not been explored in similar research
  2. Technical Breakthrough: Employ language-agnostic embeddings for sentence mapping on low-resource language pairs with empirical validation
  3. Dataset Contribution: Create the largest Konkani-Marathi corpus without manual annotation
  4. Generalizability Verification: Validate the language-agnostic nature of the method on the Punjabi-Hindi language pair

Methodology Details

Task Definition

Input: Newspaper PDF files in different languages Output: Bilingual parallel sentence pair corpus Constraints: Fully automated, no manual annotation required, language-agnostic

Model Architecture

The entire data augmentation pipeline comprises four core components:

1. Crawler Module

  • Download newspaper copies from online sources
  • Segment files into individual pages
  • Appropriately label with date, page number, and language code

2. Article Extractor

  • Functionality:
    • Mark individual article boundaries
    • Extract images and text within marked articles (using OCR)
  • Technical Implementation:
    • Use PRImA's layout analysis dataset for article boundary detection
    • Extract regions of interest (ROI) using OpenCV
    • Combine EasyOCR, PaddleOCR, and Tesseract with majority voting decision
  • Article Segmentation: Divide articles into four ROIs:
    • Title (H): including subtitles
    • Image (I)
    • Image Caption (P)
    • Content (C)

3. Article Mapper

  • Mapping Strategy: Compare article image similarity across two languages
  • Algorithm: Use SIFT (Scale-Invariant Feature Transform) as the image matching algorithm
  • Mathematical Representation:
{(a^L1_1, a^L2_1), (a^L1_2, a^L2_2)...} ≡ θ(I^L1_i, I^L2_j)

where θ is the image matching algorithm function

4. Sentence Mapper

  • Core Challenge: Sentences within mapped articles may not be sequentially ordered
  • Three Similarity Metrics:
    1. Language-Agnostic Sentence Embeddings (LAS): Based on BERT architecture, trained on 119 languages, using cosine similarity
    2. Simple Length Heuristic (SLAS): Based on sentence length and position within articles
    3. Lexical Overlap (LO): Using English as a pivot language with precision, recall, and F-Score

Technical Innovations

  1. Image Hub Strategy: Leverage the characteristic of newspapers reusing images across language versions, using images as reliable anchors for article mapping
  2. Multimodal Fusion: Combine image and text analysis to improve mapping accuracy
  3. Language Agnosticism: Use pre-trained multilingual models without customization for specific language pairs
  4. End-to-End Automation: Fully automated pipeline from raw PDFs to final parallel corpora

Experimental Setup

Datasets

  • Primary Language Pair: Konkani-Marathi
  • Validation Language Pair: Punjabi-Hindi
  • Data Source: Online newspaper PDF files
  • Time Span: Different language versions from the same date

Evaluation Metrics

  • Intrinsic Evaluation: Semantic Textual Similarity (STS), 6-level ordinal scoring (0-5)
    • 5: Complete semantic equivalence
    • 0: Complete semantic dissimilarity
  • Extrinsic Evaluation: BLEU scores on machine translation tasks

Comparison Methods

  • Sentence mapping strategy comparison: LAS vs SLAS vs LO
  • Comparison with existing Konkani-Marathi baseline (BLEU=23.5)

Implementation Details

  • Human Evaluation: Two-stage sampling of 900 sentence pairs
  • First Stage: 200 pairs per sentence alignment strategy (600 total)
  • Second Stage: Additional 300 pairs for the best strategy
  • Sampling Strategy: Stratified random sampling without order preservation

Experimental Results

Main Results

Intrinsic Evaluation Results

Sentence LengthArticle LengthLASSLASLO
1-10 words1-5 sentences3.83.42.9
11-19 words6-15 sentences3.73.43.0
20+ words16+ sentences3.83.22.6

Language Pair Comparison Results

MetricKonkani-MarathiPunjabi-Hindi
Mapped Articles1,320150
Mapped Sentence Pairs14,4482,200
Human Evaluation Samples600100
Average STS Score3.703.73

Key Findings

  1. LAS Optimal Performance: Language-Agnostic Sentence Embeddings (LAS) demonstrate superior performance across all sentence length and article length combinations
  2. High-Quality Mapping: Over 92% of mapped sentences achieve STS scores > 3
  3. Language Agnosticism: Punjabi-Hindi experimental results are comparable to the main experiment, validating the method's generalizability

Extrinsic Evaluation: Machine Translation Task

  • Model: Fine-tuned mT5 (multilingual pre-trained text-to-text transformer)
  • Training Data: Konkani-Marathi parallel corpus (titles and article content)
  • Test Data: Image captions as ground truth
  • Results: BLEU score of 26.4, approximately 3 BLEU points improvement over existing baseline (23.5)

Ablation Study

Through comparison of different sentence mapping strategies, the study demonstrates:

  1. Language-agnostic embeddings significantly outperform length heuristics and lexical overlap methods
  2. The method maintains stable performance across different article and sentence lengths
  3. The effectiveness of the embedding-based article processing strategy

Image Analysis Domain

  • Article Segmentation: Heuristic methods, graph embedding methods, deep learning methods
  • Image Matching: Traditional methods such as SIFT, SURF, BRIEF, and neural network methods like CNNs

Text Analysis Domain

  • OCR Technology: Extensive research targeting Devanagari script
  • Sentence Alignment: Length heuristics, lexical correspondence, and language-agnostic sentence embeddings based on deep learning

Konkani NLP Research

  • Existing Work: Primarily limited to fundamental tasks such as POS tagging, sentiment analysis, and NER
  • ILCI Project: Created a 25,000-sentence Hindi-Konkani corpus, achieving 23.5 BLEU score

Conclusions and Discussion

Main Conclusions

  1. The proposed method demonstrates language-agnostic properties and good scalability in constructing parallel corpora for low-resource languages
  2. The strategy of using images as article mapping hubs proves effective and innovative
  3. Language-agnostic sentence embeddings perform excellently in low-resource language sentence alignment tasks

Limitations

  1. Image Dependency: The method relies on shared images across language versions, limiting its applicability
  2. Quality Constraints: Additional constraints are needed to further improve dataset quality
  3. Scale Limitations: Currently validated primarily in the newspaper domain; applicability to other domains requires further verification

Future Directions

  1. Expand Image Sources: Consider images of the same news event captured by different individuals
  2. Quality Enhancement: Explore additional constraints to improve dataset quality
  3. Domain Extension: Apply the method to more text types and domains

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to use images as a hub for cross-lingual article mapping, with novel approach
  2. High Practical Value: Provides a practical data augmentation method for low-resource language NLP research
  3. Complete Systematicity: Well-designed complete pipeline from data collection to final evaluation
  4. Sufficient Validation: Multi-perspective verification of method effectiveness through intrinsic and extrinsic evaluation
  5. Good Reproducibility: Detailed method description with well-justified technical choices

Weaknesses

  1. Limited Applicability: Heavily dependent on the specific scenario of newspapers sharing images across language versions
  2. Small Evaluation Scale: Relatively small human evaluation samples (600-900 sentence pairs)
  3. Insufficient Baseline Comparison: Lacks comparison with other automated parallel corpus construction methods
  4. Missing Error Analysis: Lacks in-depth analysis of failure cases and error patterns

Impact

  1. Academic Contribution: Provides new perspectives for parallel corpus construction in low-resource languages
  2. Practical Application: Can be directly applied to regions with multilingual newspapers
  3. Technology Promotion: The image hub strategy may inspire other multimodal NLP tasks

Applicable Scenarios

  1. Ideal Scenarios: Regions with multilingual newspapers and image sharing
  2. Extended Scenarios: Other media content with cross-lingual image sharing characteristics
  3. Limited Scenarios: Pure text or language pairs without image sharing

References

The paper cites 19 relevant references, covering:

  • Multilingual retrieval and personalization systems
  • Document layout analysis and image processing
  • Sentence alignment and parallel corpus construction
  • Low-resource language NLP research
  • Neural machine translation related work

Overall Assessment: This is an innovative work in the field of parallel corpus construction for low-resource languages. Although the method's applicable scenarios are relatively specific, it demonstrates good performance in corresponding contexts. The proposed image hub strategy provides valuable insights for multimodal NLP research and has positive significance for advancing the digitalization process of low-resource languages.