2025-11-13T02:34:15.167959

A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics

Sharma, Goyal, Goyal et al.

Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.

academic

A Fully Automated and Scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics

Basic Information

Paper ID: 2510.13211
Title: A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics
Authors: Prawaal Sharma (Infosys), Navneet Goyal (BITS Pilani), Poonam Goyal (BITS Pilani), Vishnupriyan K R (Infosys)
Classification: cs.CL (Computational Linguistics)
Conference: SAC '23 (The 38th ACM/SIGAPP Symposium on Applied Computing), March 27-31, 2023, Tallinn, Estonia
Paper Link: https://arxiv.org/abs/2510.13211

Abstract

Global linguistic diversity has created disparities in the availability of high-quality digital language resources, thereby limiting technological advantages for most populations. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper proposes a novel, scalable, and fully automated approach to extract bilingual parallel corpora from newspaper articles using image and text analytics. The authors validate the approach by constructing parallel data corpora for two different language pairs and demonstrate the value of the dataset through machine translation downstream tasks, achieving approximately 3 BLEU points improvement over current baselines.

Research Background and Motivation

Problem Definition

Core Issue: Of the 7,000 languages globally, only 20 have sufficient resources on the internet, with the remainder classified as low-resource languages (LRLs), lacking digital data support
Scope of Impact: Over 2.5 billion people use 2,000 low-resource languages, primarily distributed in India and Africa
Technical Barriers: Modern NLP tasks require large amounts of training data, and the scarcity of digital data in low-resource languages is the primary challenge in popularizing NLP technology to the masses

Research Motivation

Construct parallel corpora for low-resource languages, particularly for low-resource to high-resource language pairs
Select Konkani-Marathi as the primary example: Konkani is a typical low-resource language with scarce digital resources and fewer native speakers; Marathi is resource-rich
Observe that local newspapers from major publishers reuse images across different language versions to optimize resources

Core Contributions

Innovative Methodology: First to use newspaper article images as a hub for article mapping, which has not been explored in similar research
Technical Breakthrough: Employ language-agnostic embeddings for sentence mapping on low-resource language pairs with empirical validation
Dataset Contribution: Create the largest Konkani-Marathi corpus without manual annotation
Generalizability Verification: Validate the language-agnostic nature of the method on the Punjabi-Hindi language pair

Methodology Details

Task Definition

Input: Newspaper PDF files in different languages Output: Bilingual parallel sentence pair corpus Constraints: Fully automated, no manual annotation required, language-agnostic

Model Architecture

The entire data augmentation pipeline comprises four core components:

1. Crawler Module

Download newspaper copies from online sources
Segment files into individual pages
Appropriately label with date, page number, and language code

2. Article Extractor

Functionality:
- Mark individual article boundaries
- Extract images and text within marked articles (using OCR)
Technical Implementation:
- Use PRImA's layout analysis dataset for article boundary detection
- Extract regions of interest (ROI) using OpenCV
- Combine EasyOCR, PaddleOCR, and Tesseract with majority voting decision
Article Segmentation: Divide articles into four ROIs:
- Title (H): including subtitles
- Image (I)
- Image Caption (P)
- Content (C)

3. Article Mapper

Mapping Strategy: Compare article image similarity across two languages
Algorithm: Use SIFT (Scale-Invariant Feature Transform) as the image matching algorithm
Mathematical Representation:

{(a^L1_1, a^L2_1), (a^L1_2, a^L2_2)...} ≡ θ(I^L1_i, I^L2_j)

where θ is the image matching algorithm function

4. Sentence Mapper

Core Challenge: Sentences within mapped articles may not be sequentially ordered
Three Similarity Metrics:
1. Language-Agnostic Sentence Embeddings (LAS): Based on BERT architecture, trained on 119 languages, using cosine similarity
2. Simple Length Heuristic (SLAS): Based on sentence length and position within articles
3. Lexical Overlap (LO): Using English as a pivot language with precision, recall, and F-Score

Technical Innovations

Image Hub Strategy: Leverage the characteristic of newspapers reusing images across language versions, using images as reliable anchors for article mapping
Multimodal Fusion: Combine image and text analysis to improve mapping accuracy
Language Agnosticism: Use pre-trained multilingual models without customization for specific language pairs
End-to-End Automation: Fully automated pipeline from raw PDFs to final parallel corpora

Experimental Setup

Datasets

Primary Language Pair: Konkani-Marathi
Validation Language Pair: Punjabi-Hindi
Data Source: Online newspaper PDF files
Time Span: Different language versions from the same date

Evaluation Metrics

Intrinsic Evaluation: Semantic Textual Similarity (STS), 6-level ordinal scoring (0-5)
- 5: Complete semantic equivalence
- 0: Complete semantic dissimilarity
Extrinsic Evaluation: BLEU scores on machine translation tasks

Comparison Methods

Sentence mapping strategy comparison: LAS vs SLAS vs LO
Comparison with existing Konkani-Marathi baseline (BLEU=23.5)

Implementation Details

Human Evaluation: Two-stage sampling of 900 sentence pairs
First Stage: 200 pairs per sentence alignment strategy (600 total)
Second Stage: Additional 300 pairs for the best strategy
Sampling Strategy: Stratified random sampling without order preservation

Experimental Results

Main Results

Intrinsic Evaluation Results

Sentence Length	Article Length	LAS	SLAS	LO
1-10 words	1-5 sentences	3.8	3.4	2.9
11-19 words	6-15 sentences	3.7	3.4	3.0
20+ words	16+ sentences	3.8	3.2	2.6

Language Pair Comparison Results

Metric	Konkani-Marathi	Punjabi-Hindi
Mapped Articles	1,320	150
Mapped Sentence Pairs	14,448	2,200
Human Evaluation Samples	600	100
Average STS Score	3.70	3.73

Key Findings

LAS Optimal Performance: Language-Agnostic Sentence Embeddings (LAS) demonstrate superior performance across all sentence length and article length combinations
High-Quality Mapping: Over 92% of mapped sentences achieve STS scores > 3
Language Agnosticism: Punjabi-Hindi experimental results are comparable to the main experiment, validating the method's generalizability

Extrinsic Evaluation: Machine Translation Task

Model: Fine-tuned mT5 (multilingual pre-trained text-to-text transformer)
Training Data: Konkani-Marathi parallel corpus (titles and article content)
Test Data: Image captions as ground truth
Results: BLEU score of 26.4, approximately 3 BLEU points improvement over existing baseline (23.5)

Ablation Study

Through comparison of different sentence mapping strategies, the study demonstrates:

Language-agnostic embeddings significantly outperform length heuristics and lexical overlap methods
The method maintains stable performance across different article and sentence lengths
The effectiveness of the embedding-based article processing strategy

Image Analysis Domain

Article Segmentation: Heuristic methods, graph embedding methods, deep learning methods
Image Matching: Traditional methods such as SIFT, SURF, BRIEF, and neural network methods like CNNs

Text Analysis Domain

OCR Technology: Extensive research targeting Devanagari script
Sentence Alignment: Length heuristics, lexical correspondence, and language-agnostic sentence embeddings based on deep learning

Konkani NLP Research

Existing Work: Primarily limited to fundamental tasks such as POS tagging, sentiment analysis, and NER
ILCI Project: Created a 25,000-sentence Hindi-Konkani corpus, achieving 23.5 BLEU score

Conclusions and Discussion

Main Conclusions

The proposed method demonstrates language-agnostic properties and good scalability in constructing parallel corpora for low-resource languages
The strategy of using images as article mapping hubs proves effective and innovative
Language-agnostic sentence embeddings perform excellently in low-resource language sentence alignment tasks

Limitations

Image Dependency: The method relies on shared images across language versions, limiting its applicability
Quality Constraints: Additional constraints are needed to further improve dataset quality
Scale Limitations: Currently validated primarily in the newspaper domain; applicability to other domains requires further verification

Future Directions

Expand Image Sources: Consider images of the same news event captured by different individuals
Quality Enhancement: Explore additional constraints to improve dataset quality
Domain Extension: Apply the method to more text types and domains

In-Depth Evaluation

Strengths

Strong Innovation: First to use images as a hub for cross-lingual article mapping, with novel approach
High Practical Value: Provides a practical data augmentation method for low-resource language NLP research
Complete Systematicity: Well-designed complete pipeline from data collection to final evaluation
Sufficient Validation: Multi-perspective verification of method effectiveness through intrinsic and extrinsic evaluation
Good Reproducibility: Detailed method description with well-justified technical choices

Weaknesses

Limited Applicability: Heavily dependent on the specific scenario of newspapers sharing images across language versions
Small Evaluation Scale: Relatively small human evaluation samples (600-900 sentence pairs)
Insufficient Baseline Comparison: Lacks comparison with other automated parallel corpus construction methods
Missing Error Analysis: Lacks in-depth analysis of failure cases and error patterns

Impact

Academic Contribution: Provides new perspectives for parallel corpus construction in low-resource languages
Practical Application: Can be directly applied to regions with multilingual newspapers
Technology Promotion: The image hub strategy may inspire other multimodal NLP tasks

Applicable Scenarios

Ideal Scenarios: Regions with multilingual newspapers and image sharing
Extended Scenarios: Other media content with cross-lingual image sharing characteristics
Limited Scenarios: Pure text or language pairs without image sharing

References

The paper cites 19 relevant references, covering:

Multilingual retrieval and personalization systems
Document layout analysis and image processing
Sentence alignment and parallel corpus construction
Low-resource language NLP research
Neural machine translation related work

Overall Assessment: This is an innovative work in the field of parallel corpus construction for low-resource languages. Although the method's applicable scenarios are relatively specific, it demonstrates good performance in corresponding contexts. The proposed image hub strategy provides valuable insights for multimodal NLP research and has positive significance for advancing the digitalization process of low-resource languages.