2025-11-18T09:52:19.958339

Part-of-Speech Tagger for Bodo Language using Deep Learning approach

Pathak, Narzary, Nandi et al.
Language Processing systems such as Part-of-speech tagging, Named entity recognition, Machine translation, Speech recognition, and Language modeling (LM) are well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model plays a vital role in the downstream tasks of modern NLP. Extensive studies are carried out on LMs for high-resource languages. Nevertheless, languages such as Bodo, Rabha, and Mising continue to lack coverage. In this study, we first present BodoBERT, a language model for the Bodo language. To the best of our knowledge, this work is the first such effort to develop a language model for Bodo. Secondly, we present an ensemble DL-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with CRF and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several language models in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.
academic

Part-of-Speech Tagger for Bodo Language using Deep Learning approach

Basic Information

  • Paper ID: 2401.03175
  • Title: Part-of-Speech Tagger for Bodo Language using Deep Learning approach
  • Authors: Dhrubajyoti Pathak, Sanjib Narzary, Sukumar Nandi, Bidisha Som
  • Institution: Centre for Linguistic Science and Technology, IIT Guwahati
  • Classification: cs.CL cs.AI cs.LG
  • Published Journal: Natural Language Engineering (Accepted)
  • Paper Link: https://arxiv.org/abs/2401.03175

Abstract

This research addresses natural language processing for Bodo (Boro), a low-resource language. While NLP tasks such as part-of-speech tagging, named entity recognition, and machine translation have been extensively studied for high-resource languages, research on low-resource languages like Bodo, Mizo, and Nagamese remains in its infancy. This paper first proposes BodoBERT, the first pre-trained language model specifically designed for the Bodo language. Second, based on a BiLSTM-CRF architecture and stacked embeddings combining BodoBERT with BytePair embeddings, an ensemble deep learning POS tagging model is developed. The optimal model achieves an F1 score of 0.8041 on the Bodo language POS tagging task.

Research Background and Motivation

Problem Definition

  1. Core Challenge: Bodo language, spoken by 1.5 million people in northeastern India and ranked as India's 20th largest language, lacks fundamental NLP tools and resources
  2. Technical Obstacles:
    • Absence of pre-trained language models covering Bodo
    • Scarcity of annotated data (only ~30k annotated sentences available)
    • Complex linguistic characteristics (Tibeto-Burman language family with rich morphology)

Significance Analysis

  • Linguistic Status: Bodo is one of 22 official languages of India and the official language of Bodoland Territorial Region
  • Application Demand: 1.5 million speakers urgently require corresponding NLP tool support
  • Academic Value: Fills a critical gap in low-resource language NLP research

Existing Limitations

  • Fundamental NLP tasks (morphological analysis, dependency parsing, language identification) remain unexplored
  • No available pre-trained language models
  • Lack of deep learning-based downstream NLP tools

Core Contributions

  1. First Bodo Language Model: Proposes BodoBERT based on BERT architecture, the first pre-trained language model specifically trained for Bodo
  2. Multi-Architecture POS Tagger Comparison: Systematically compares three sequence labeling architectures: CRF, Fine-tuning, and BiLSTM-CRF
  3. Multi-Lingual Model Performance Analysis: Evaluates performance of FastText, BPE, XLM-R, FlairEmbedding, IndicBERT, MuRIL, and other language models on Bodo POS tagging
  4. Stacked Embedding Method: Proposes both Individual and Stacked embedding approaches, with the Stacked method significantly improving performance
  5. Open-Source Resources: Publicly releases the optimal POS tagger model and BodoBERT

Methodology Details

Task Definition

Input: Bodo language sentence sequences Output: POS labels for each word (34 labels based on BIS tagset) Constraints: Uses Devanagari script, adheres to Indian language standards (BIS tagset)

BodoBERT Language Model

Corpus Construction

  • Data Sources:
    • Linguistic Data Consortium for Indian Languages (LDC-IL)
    • Work by Narzary et al. (2022)
  • Corpus Scale: 1.6M tokens, 191k sentences
  • Domain Coverage: Aesthetics, business, mass media, technology, social sciences, and other domains

Model Architecture

  • Base Architecture: Multi-layer bidirectional Transformer (based on BERT framework)
  • Key Parameters:
    • 6 Transformer blocks
    • Hidden layer dimension: 768
    • Number of self-attention heads: 6
    • Total parameters: ~103M
    • Vocabulary size: 50,000 (WordPiece tokenizer)

Training Configuration

  • Hardware: Nvidia Tesla P100 GPU
  • Training Steps: 300K steps
  • Sequence Length: 128
  • Batch Size: 64
  • Optimizer: Adam (learning rate 2e-5, 3000-step warm-up)
  • Training Time: ~7 days

POS Tagging Model Architecture

Three Sequence Labeling Methods

  1. CRF Model: BodoBERT embeddings + CRF layer
  2. Fine-tuning Model: Direct fine-tuning of BodoBERT for POS tagging
  3. BiLSTM-CRF Model: BodoBERT embeddings + BiLSTM + CRF layer

Embedding Methods

  1. Individual Method: Using individual language models separately
  2. Stacked Method: Combining BodoBERT with other language models

Technical Innovations

  1. Language Adaptation: First specialized language model designed for Bodo linguistic characteristics
  2. Multi-Model Fusion: Systematic comparison and fusion of multiple pre-trained models
  3. Cross-Lingual Transfer: Leveraging Hindi models sharing the same writing system (Devanagari)
  4. Stacking Strategy: Innovatively combining specialized language models with general-purpose models

Experimental Setup

Dataset

  • Annotated Corpus: Bodo Monolingual Text Corpus (ILCI-II)
  • Data Scale:
    • Training set: 24,003 sentences, 192k tokens
    • Validation set: 2,325 sentences, 23k tokens
    • Test set: 3,161 sentences, 23k tokens
  • Label System: BIS tagset with 11 top-level categories and 34 specific labels
  • Data Format: CoNLL-2003 format

Evaluation Metrics

  • Primary Metric: F1-score (Micro)
  • Secondary Metrics: F1-score (Weighted), Precision, Recall
  • Label-Level Analysis: Detailed performance for each POS label

Baseline Methods

Language Model Comparison

ModelTraining CorpusData Volume
FastTextWiki<29M
BytePairWiki29M
BodoBERTBodo corpus1.6M
FlairEmbeddingsWiki+OPUS≈29M
MuRILCommonCrawl+Wiki788M
XLM-RCC-1001.7B
IndicBERTScraping1.84B

Architecture Comparison

  • CRF vs Fine-tuning vs BiLSTM-CRF
  • Individual vs Stacked embedding methods

Implementation Details

  • Framework: Flair framework
  • Batch Size: 32
  • Early Stopping Strategy: Stop when validation performance plateaus
  • Learning Rate Schedule: Learning Rate Annealing

Experimental Results

Main Results

Architecture Comparison

Embedding MethodTagging ModelF1-score(Micro)F1-score(Weighted)
BodoBERTCRF0.75830.7454
BodoBERTFine-tuned BERT0.77540.7775
BodoBERTBiLSTM + CRF0.79490.7898

Individual Method Language Model Comparison

Embedding ModelBodo F1Assamese F1
FastText0.76860.6981
BytePair0.76690.7099
BodoBERT0.79490.7033
FlairEmbeddings0.78850.7076
MuRIL0.77080.7286
XLM-R0.76380.7001
IndicBERT0.72350.7293

Stacked Method Results

Stacked Embedding CombinationF1 Score
BodoBERT + FastText0.7928
BodoBERT + BytePair0.8041
BodoBERT + mBERT0.799
BodoBERT + FlairEmbeddings0.801
BodoBERT + MuRIL0.785
BodoBERT + XLM-R0.8003
BodoBERT + IndicBERT0.793

Data Augmentation Experiments

By adding 10k automatically annotated and manually corrected sentences:

  • Performance Improvement: F1 increased from 0.8041 to 0.8494 (+1-2%)
  • Validates Model Scalability

Label-Level Analysis

Best model performance on major POS labels:

  • V_VM (Verb): F1=0.9150 (highest)
  • RD_PUNC (Punctuation): F1=0.9944 (near perfect)
  • N_NN (Noun): F1=0.7628 (largest class)
  • N_NNP (Proper Noun): F1=0.6946 (more difficult)

Error Analysis

Main error patterns discovered through confusion matrices:

  1. Intra-Class Confusion: Common nouns (N_NN) vs proper nouns (N_NNP), locative nouns (N_NST)
  2. Part-of-Speech Conversion: Difficulty in tagging nouns used as adjectives
  3. Writing System Limitations: Bodo lacks capitalization conventions for proper nouns like English

Cross-Lingual Comparison

Bodo vs Assamese POS tagging results:

  • Bodo Best: 0.8041 (BodoBERT+BytePair)
  • Assamese Best: 0.7293 (IndicBERT)
  • Difference Rationale: Different label set complexity (Bodo 34 labels vs Assamese 41 labels)

Low-Resource Language POS Tagging

  • Assamese: Pathak et al. (2022, 2023) - BiLSTM-CRF achieves 86.52% F1
  • Khasi: Warjri et al. (2021) - 96.98% accuracy
  • Bengali: Alam et al. (2016) - 86.0% accuracy, Kabir et al. (2016) - 93.33% accuracy
  • Mizo: Pandey et al. (2022) - LSTM achieves 81.86% accuracy

Advantages of This Work

  1. Originality: First neural network-based POS tagger for Bodo language
  2. Systematicity: Comprehensive comparison of multiple architectures and language models
  3. Practicality: Provides open-source models and tools

Conclusions and Discussion

Main Conclusions

  1. BodoBERT Effectiveness: Specialized language models perform best on downstream tasks
  2. Architecture Advantages: BiLSTM-CRF architecture outperforms CRF and Fine-tuning
  3. Stacking Strategy Effectiveness: Combined embeddings outperform single embeddings
  4. Baseline Establishment: Establishes important baseline for Bodo language NLP research

Limitations

  1. Data Scale: Annotated corpus is relatively small (30k sentences)
  2. Language Model Training Data: BodoBERT trained on only 1.6M tokens
  3. Performance Level: Still lags behind high-resource languages (F1=0.8041 vs 90%+)
  4. Annotation Quality: Some annotations may require further correction

Future Directions

  1. Expand Corpus: Collect more Bodo language text and annotated data
  2. Model Improvement: Optimize BodoBERT architecture and training strategies
  3. Downstream Tasks: Extend to NER, syntactic parsing, and other NLP tasks
  4. Multilingual Modeling: Explore joint modeling with related languages

In-Depth Evaluation

Strengths

  1. Pioneering Contribution: First construction of language model and POS tagger for Bodo, filling an important gap
  2. Systematic Research: Comprehensive comparison of multiple methods with well-designed experiments
  3. Technical Innovation: Stacked embedding strategy effectively improves performance
  4. Practical Value: Open-source release of models provides foundational tools for the community
  5. Cross-Lingual Insights: Valuable cross-lingual analysis through Assamese comparison

Weaknesses

  1. Data Limitations: Relatively small training data scale may affect model generalization
  2. Evaluation Limitations: Lacks comparison with traditional methods (HMM, rule-based approaches)
  3. Error Analysis Depth: Insufficient linguistic analysis of model failure cases
  4. Computational Resources: High model training costs may limit reproducibility

Impact

  1. Academic Value: Provides important paradigm for low-resource language NLP research
  2. Practical Significance: Directly serves the practical needs of the Bodo language community
  3. Methodological Contribution: Stacked embedding strategy generalizable to other low-resource languages
  4. Infrastructure: Establishes foundation for subsequent Bodo language NLP research

Applicable Scenarios

  1. Direct Application: Bodo language text processing, information extraction
  2. Research Foundation: Preprocessing step for other Bodo language NLP tasks
  3. Method Transfer: POS tagging tasks for similar low-resource languages
  4. Multilingual Systems: Component of multilingual NLP systems for northeastern India

References

This paper cites abundant related work, primarily including:

  • BERT-related: Devlin et al. (2018) - Original BERT paper
  • Sequence Labeling: Huang et al. (2015) - BiLSTM-CRF architecture
  • Low-Resource Languages: Multiple Indian regional language NLP studies
  • Language Models: Original papers of various pre-trained models

Overall Assessment: This is a high-quality low-resource language NLP research paper with important contributions in methodological innovation, experimental design, and practical value. While constrained by data scale, it opens new directions for Bodo language NLP research with significant academic and social value.