A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics
Sharma, Goyal, Goyal et al.
Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.
academic
A Fully Automated and Scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics
Global linguistic diversity has created disparities in the availability of high-quality digital language resources, thereby limiting technological advantages for most populations. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper proposes a novel, scalable, and fully automated approach to extract bilingual parallel corpora from newspaper articles using image and text analytics. The authors validate the approach by constructing parallel data corpora for two different language pairs and demonstrate the value of the dataset through machine translation downstream tasks, achieving approximately 3 BLEU points improvement over current baselines.
Core Issue: Of the 7,000 languages globally, only 20 have sufficient resources on the internet, with the remainder classified as low-resource languages (LRLs), lacking digital data support
Scope of Impact: Over 2.5 billion people use 2,000 low-resource languages, primarily distributed in India and Africa
Technical Barriers: Modern NLP tasks require large amounts of training data, and the scarcity of digital data in low-resource languages is the primary challenge in popularizing NLP technology to the masses
Construct parallel corpora for low-resource languages, particularly for low-resource to high-resource language pairs
Select Konkani-Marathi as the primary example: Konkani is a typical low-resource language with scarce digital resources and fewer native speakers; Marathi is resource-rich
Observe that local newspapers from major publishers reuse images across different language versions to optimize resources
Input: Newspaper PDF files in different languages
Output: Bilingual parallel sentence pair corpus
Constraints: Fully automated, no manual annotation required, language-agnostic
Image Hub Strategy: Leverage the characteristic of newspapers reusing images across language versions, using images as reliable anchors for article mapping
Multimodal Fusion: Combine image and text analysis to improve mapping accuracy
Language Agnosticism: Use pre-trained multilingual models without customization for specific language pairs
End-to-End Automation: Fully automated pipeline from raw PDFs to final parallel corpora
LAS Optimal Performance: Language-Agnostic Sentence Embeddings (LAS) demonstrate superior performance across all sentence length and article length combinations
High-Quality Mapping: Over 92% of mapped sentences achieve STS scores > 3
Language Agnosticism: Punjabi-Hindi experimental results are comparable to the main experiment, validating the method's generalizability
Multilingual retrieval and personalization systems
Document layout analysis and image processing
Sentence alignment and parallel corpus construction
Low-resource language NLP research
Neural machine translation related work
Overall Assessment: This is an innovative work in the field of parallel corpus construction for low-resource languages. Although the method's applicable scenarios are relatively specific, it demonstrates good performance in corresponding contexts. The proposed image hub strategy provides valuable insights for multimodal NLP research and has positive significance for advancing the digitalization process of low-resource languages.