Language Processing systems such as Part-of-speech tagging, Named entity recognition, Machine translation, Speech recognition, and Language modeling (LM) are well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model plays a vital role in the downstream tasks of modern NLP. Extensive studies are carried out on LMs for high-resource languages. Nevertheless, languages such as Bodo, Rabha, and Mising continue to lack coverage. In this study, we first present BodoBERT, a language model for the Bodo language. To the best of our knowledge, this work is the first such effort to develop a language model for Bodo. Secondly, we present an ensemble DL-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with CRF and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several language models in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.
academic- Paper ID: 2401.03175
- Title: Part-of-Speech Tagger for Bodo Language using Deep Learning approach
- Authors: Dhrubajyoti Pathak, Sanjib Narzary, Sukumar Nandi, Bidisha Som
- Institution: Centre for Linguistic Science and Technology, IIT Guwahati
- Classification: cs.CL cs.AI cs.LG
- Published Journal: Natural Language Engineering (Accepted)
- Paper Link: https://arxiv.org/abs/2401.03175
This research addresses natural language processing for Bodo (Boro), a low-resource language. While NLP tasks such as part-of-speech tagging, named entity recognition, and machine translation have been extensively studied for high-resource languages, research on low-resource languages like Bodo, Mizo, and Nagamese remains in its infancy. This paper first proposes BodoBERT, the first pre-trained language model specifically designed for the Bodo language. Second, based on a BiLSTM-CRF architecture and stacked embeddings combining BodoBERT with BytePair embeddings, an ensemble deep learning POS tagging model is developed. The optimal model achieves an F1 score of 0.8041 on the Bodo language POS tagging task.
- Core Challenge: Bodo language, spoken by 1.5 million people in northeastern India and ranked as India's 20th largest language, lacks fundamental NLP tools and resources
- Technical Obstacles:
- Absence of pre-trained language models covering Bodo
- Scarcity of annotated data (only ~30k annotated sentences available)
- Complex linguistic characteristics (Tibeto-Burman language family with rich morphology)
- Linguistic Status: Bodo is one of 22 official languages of India and the official language of Bodoland Territorial Region
- Application Demand: 1.5 million speakers urgently require corresponding NLP tool support
- Academic Value: Fills a critical gap in low-resource language NLP research
- Fundamental NLP tasks (morphological analysis, dependency parsing, language identification) remain unexplored
- No available pre-trained language models
- Lack of deep learning-based downstream NLP tools
- First Bodo Language Model: Proposes BodoBERT based on BERT architecture, the first pre-trained language model specifically trained for Bodo
- Multi-Architecture POS Tagger Comparison: Systematically compares three sequence labeling architectures: CRF, Fine-tuning, and BiLSTM-CRF
- Multi-Lingual Model Performance Analysis: Evaluates performance of FastText, BPE, XLM-R, FlairEmbedding, IndicBERT, MuRIL, and other language models on Bodo POS tagging
- Stacked Embedding Method: Proposes both Individual and Stacked embedding approaches, with the Stacked method significantly improving performance
- Open-Source Resources: Publicly releases the optimal POS tagger model and BodoBERT
Input: Bodo language sentence sequences
Output: POS labels for each word (34 labels based on BIS tagset)
Constraints: Uses Devanagari script, adheres to Indian language standards (BIS tagset)
- Data Sources:
- Linguistic Data Consortium for Indian Languages (LDC-IL)
- Work by Narzary et al. (2022)
- Corpus Scale: 1.6M tokens, 191k sentences
- Domain Coverage: Aesthetics, business, mass media, technology, social sciences, and other domains
- Base Architecture: Multi-layer bidirectional Transformer (based on BERT framework)
- Key Parameters:
- 6 Transformer blocks
- Hidden layer dimension: 768
- Number of self-attention heads: 6
- Total parameters: ~103M
- Vocabulary size: 50,000 (WordPiece tokenizer)
- Hardware: Nvidia Tesla P100 GPU
- Training Steps: 300K steps
- Sequence Length: 128
- Batch Size: 64
- Optimizer: Adam (learning rate 2e-5, 3000-step warm-up)
- Training Time: ~7 days
- CRF Model: BodoBERT embeddings + CRF layer
- Fine-tuning Model: Direct fine-tuning of BodoBERT for POS tagging
- BiLSTM-CRF Model: BodoBERT embeddings + BiLSTM + CRF layer
- Individual Method: Using individual language models separately
- Stacked Method: Combining BodoBERT with other language models
- Language Adaptation: First specialized language model designed for Bodo linguistic characteristics
- Multi-Model Fusion: Systematic comparison and fusion of multiple pre-trained models
- Cross-Lingual Transfer: Leveraging Hindi models sharing the same writing system (Devanagari)
- Stacking Strategy: Innovatively combining specialized language models with general-purpose models
- Annotated Corpus: Bodo Monolingual Text Corpus (ILCI-II)
- Data Scale:
- Training set: 24,003 sentences, 192k tokens
- Validation set: 2,325 sentences, 23k tokens
- Test set: 3,161 sentences, 23k tokens
- Label System: BIS tagset with 11 top-level categories and 34 specific labels
- Data Format: CoNLL-2003 format
- Primary Metric: F1-score (Micro)
- Secondary Metrics: F1-score (Weighted), Precision, Recall
- Label-Level Analysis: Detailed performance for each POS label
| Model | Training Corpus | Data Volume |
|---|
| FastText | Wiki | <29M |
| BytePair | Wiki | 29M |
| BodoBERT | Bodo corpus | 1.6M |
| FlairEmbeddings | Wiki+OPUS | ≈29M |
| MuRIL | CommonCrawl+Wiki | 788M |
| XLM-R | CC-100 | 1.7B |
| IndicBERT | Scraping | 1.84B |
- CRF vs Fine-tuning vs BiLSTM-CRF
- Individual vs Stacked embedding methods
- Framework: Flair framework
- Batch Size: 32
- Early Stopping Strategy: Stop when validation performance plateaus
- Learning Rate Schedule: Learning Rate Annealing
| Embedding Method | Tagging Model | F1-score(Micro) | F1-score(Weighted) |
|---|
| BodoBERT | CRF | 0.7583 | 0.7454 |
| BodoBERT | Fine-tuned BERT | 0.7754 | 0.7775 |
| BodoBERT | BiLSTM + CRF | 0.7949 | 0.7898 |
| Embedding Model | Bodo F1 | Assamese F1 |
|---|
| FastText | 0.7686 | 0.6981 |
| BytePair | 0.7669 | 0.7099 |
| BodoBERT | 0.7949 | 0.7033 |
| FlairEmbeddings | 0.7885 | 0.7076 |
| MuRIL | 0.7708 | 0.7286 |
| XLM-R | 0.7638 | 0.7001 |
| IndicBERT | 0.7235 | 0.7293 |
| Stacked Embedding Combination | F1 Score |
|---|
| BodoBERT + FastText | 0.7928 |
| BodoBERT + BytePair | 0.8041 |
| BodoBERT + mBERT | 0.799 |
| BodoBERT + FlairEmbeddings | 0.801 |
| BodoBERT + MuRIL | 0.785 |
| BodoBERT + XLM-R | 0.8003 |
| BodoBERT + IndicBERT | 0.793 |
By adding 10k automatically annotated and manually corrected sentences:
- Performance Improvement: F1 increased from 0.8041 to 0.8494 (+1-2%)
- Validates Model Scalability
Best model performance on major POS labels:
- V_VM (Verb): F1=0.9150 (highest)
- RD_PUNC (Punctuation): F1=0.9944 (near perfect)
- N_NN (Noun): F1=0.7628 (largest class)
- N_NNP (Proper Noun): F1=0.6946 (more difficult)
Main error patterns discovered through confusion matrices:
- Intra-Class Confusion: Common nouns (N_NN) vs proper nouns (N_NNP), locative nouns (N_NST)
- Part-of-Speech Conversion: Difficulty in tagging nouns used as adjectives
- Writing System Limitations: Bodo lacks capitalization conventions for proper nouns like English
Bodo vs Assamese POS tagging results:
- Bodo Best: 0.8041 (BodoBERT+BytePair)
- Assamese Best: 0.7293 (IndicBERT)
- Difference Rationale: Different label set complexity (Bodo 34 labels vs Assamese 41 labels)
- Assamese: Pathak et al. (2022, 2023) - BiLSTM-CRF achieves 86.52% F1
- Khasi: Warjri et al. (2021) - 96.98% accuracy
- Bengali: Alam et al. (2016) - 86.0% accuracy, Kabir et al. (2016) - 93.33% accuracy
- Mizo: Pandey et al. (2022) - LSTM achieves 81.86% accuracy
- Originality: First neural network-based POS tagger for Bodo language
- Systematicity: Comprehensive comparison of multiple architectures and language models
- Practicality: Provides open-source models and tools
- BodoBERT Effectiveness: Specialized language models perform best on downstream tasks
- Architecture Advantages: BiLSTM-CRF architecture outperforms CRF and Fine-tuning
- Stacking Strategy Effectiveness: Combined embeddings outperform single embeddings
- Baseline Establishment: Establishes important baseline for Bodo language NLP research
- Data Scale: Annotated corpus is relatively small (30k sentences)
- Language Model Training Data: BodoBERT trained on only 1.6M tokens
- Performance Level: Still lags behind high-resource languages (F1=0.8041 vs 90%+)
- Annotation Quality: Some annotations may require further correction
- Expand Corpus: Collect more Bodo language text and annotated data
- Model Improvement: Optimize BodoBERT architecture and training strategies
- Downstream Tasks: Extend to NER, syntactic parsing, and other NLP tasks
- Multilingual Modeling: Explore joint modeling with related languages
- Pioneering Contribution: First construction of language model and POS tagger for Bodo, filling an important gap
- Systematic Research: Comprehensive comparison of multiple methods with well-designed experiments
- Technical Innovation: Stacked embedding strategy effectively improves performance
- Practical Value: Open-source release of models provides foundational tools for the community
- Cross-Lingual Insights: Valuable cross-lingual analysis through Assamese comparison
- Data Limitations: Relatively small training data scale may affect model generalization
- Evaluation Limitations: Lacks comparison with traditional methods (HMM, rule-based approaches)
- Error Analysis Depth: Insufficient linguistic analysis of model failure cases
- Computational Resources: High model training costs may limit reproducibility
- Academic Value: Provides important paradigm for low-resource language NLP research
- Practical Significance: Directly serves the practical needs of the Bodo language community
- Methodological Contribution: Stacked embedding strategy generalizable to other low-resource languages
- Infrastructure: Establishes foundation for subsequent Bodo language NLP research
- Direct Application: Bodo language text processing, information extraction
- Research Foundation: Preprocessing step for other Bodo language NLP tasks
- Method Transfer: POS tagging tasks for similar low-resource languages
- Multilingual Systems: Component of multilingual NLP systems for northeastern India
This paper cites abundant related work, primarily including:
- BERT-related: Devlin et al. (2018) - Original BERT paper
- Sequence Labeling: Huang et al. (2015) - BiLSTM-CRF architecture
- Low-Resource Languages: Multiple Indian regional language NLP studies
- Language Models: Original papers of various pre-trained models
Overall Assessment: This is a high-quality low-resource language NLP research paper with important contributions in methodological innovation, experimental design, and practical value. While constrained by data scale, it opens new directions for Bodo language NLP research with significant academic and social value.