2025-11-18T09:52:19.958339

Part-of-Speech Tagger for Bodo Language using Deep Learning approach

Pathak, Narzary, Nandi et al.

Language Processing systems such as Part-of-speech tagging, Named entity recognition, Machine translation, Speech recognition, and Language modeling (LM) are well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model plays a vital role in the downstream tasks of modern NLP. Extensive studies are carried out on LMs for high-resource languages. Nevertheless, languages such as Bodo, Rabha, and Mising continue to lack coverage. In this study, we first present BodoBERT, a language model for the Bodo language. To the best of our knowledge, this work is the first such effort to develop a language model for Bodo. Secondly, we present an ensemble DL-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with CRF and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several language models in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.

academic

Part-of-Speech Tagger for Bodo Language using Deep Learning approach

Basic Information

Paper ID: 2401.03175
Title: Part-of-Speech Tagger for Bodo Language using Deep Learning approach
Authors: Dhrubajyoti Pathak, Sanjib Narzary, Sukumar Nandi, Bidisha Som
Institution: Centre for Linguistic Science and Technology, IIT Guwahati
Classification: cs.CL cs.AI cs.LG
Published Journal: Natural Language Engineering (Accepted)
Paper Link: https://arxiv.org/abs/2401.03175

Abstract

This research addresses natural language processing for Bodo (Boro), a low-resource language. While NLP tasks such as part-of-speech tagging, named entity recognition, and machine translation have been extensively studied for high-resource languages, research on low-resource languages like Bodo, Mizo, and Nagamese remains in its infancy. This paper first proposes BodoBERT, the first pre-trained language model specifically designed for the Bodo language. Second, based on a BiLSTM-CRF architecture and stacked embeddings combining BodoBERT with BytePair embeddings, an ensemble deep learning POS tagging model is developed. The optimal model achieves an F1 score of 0.8041 on the Bodo language POS tagging task.

Research Background and Motivation

Problem Definition

Core Challenge: Bodo language, spoken by 1.5 million people in northeastern India and ranked as India's 20th largest language, lacks fundamental NLP tools and resources
Technical Obstacles:
- Absence of pre-trained language models covering Bodo
- Scarcity of annotated data (only ~30k annotated sentences available)
- Complex linguistic characteristics (Tibeto-Burman language family with rich morphology)

Significance Analysis

Linguistic Status: Bodo is one of 22 official languages of India and the official language of Bodoland Territorial Region
Application Demand: 1.5 million speakers urgently require corresponding NLP tool support
Academic Value: Fills a critical gap in low-resource language NLP research

Existing Limitations

Fundamental NLP tasks (morphological analysis, dependency parsing, language identification) remain unexplored
No available pre-trained language models
Lack of deep learning-based downstream NLP tools

Core Contributions

First Bodo Language Model: Proposes BodoBERT based on BERT architecture, the first pre-trained language model specifically trained for Bodo
Multi-Architecture POS Tagger Comparison: Systematically compares three sequence labeling architectures: CRF, Fine-tuning, and BiLSTM-CRF
Multi-Lingual Model Performance Analysis: Evaluates performance of FastText, BPE, XLM-R, FlairEmbedding, IndicBERT, MuRIL, and other language models on Bodo POS tagging
Stacked Embedding Method: Proposes both Individual and Stacked embedding approaches, with the Stacked method significantly improving performance
Open-Source Resources: Publicly releases the optimal POS tagger model and BodoBERT

Methodology Details

Task Definition

Input: Bodo language sentence sequences Output: POS labels for each word (34 labels based on BIS tagset) Constraints: Uses Devanagari script, adheres to Indian language standards (BIS tagset)

BodoBERT Language Model

Corpus Construction

Data Sources:
- Linguistic Data Consortium for Indian Languages (LDC-IL)
- Work by Narzary et al. (2022)
Corpus Scale: 1.6M tokens, 191k sentences
Domain Coverage: Aesthetics, business, mass media, technology, social sciences, and other domains

Model Architecture

Base Architecture: Multi-layer bidirectional Transformer (based on BERT framework)
Key Parameters:
- 6 Transformer blocks
- Hidden layer dimension: 768
- Number of self-attention heads: 6
- Total parameters: ~103M
- Vocabulary size: 50,000 (WordPiece tokenizer)

Training Configuration

Hardware: Nvidia Tesla P100 GPU
Training Steps: 300K steps
Sequence Length: 128
Batch Size: 64
Optimizer: Adam (learning rate 2e-5, 3000-step warm-up)
Training Time: ~7 days

POS Tagging Model Architecture

Three Sequence Labeling Methods

CRF Model: BodoBERT embeddings + CRF layer
Fine-tuning Model: Direct fine-tuning of BodoBERT for POS tagging
BiLSTM-CRF Model: BodoBERT embeddings + BiLSTM + CRF layer

Embedding Methods

Individual Method: Using individual language models separately
Stacked Method: Combining BodoBERT with other language models

Technical Innovations

Language Adaptation: First specialized language model designed for Bodo linguistic characteristics
Multi-Model Fusion: Systematic comparison and fusion of multiple pre-trained models
Cross-Lingual Transfer: Leveraging Hindi models sharing the same writing system (Devanagari)
Stacking Strategy: Innovatively combining specialized language models with general-purpose models

Experimental Setup

Dataset

Annotated Corpus: Bodo Monolingual Text Corpus (ILCI-II)
Data Scale:
- Training set: 24,003 sentences, 192k tokens
- Validation set: 2,325 sentences, 23k tokens
- Test set: 3,161 sentences, 23k tokens
Label System: BIS tagset with 11 top-level categories and 34 specific labels
Data Format: CoNLL-2003 format

Evaluation Metrics

Primary Metric: F1-score (Micro)
Secondary Metrics: F1-score (Weighted), Precision, Recall
Label-Level Analysis: Detailed performance for each POS label

Baseline Methods

Language Model Comparison

Model	Training Corpus	Data Volume
FastText	Wiki	<29M
BytePair	Wiki	29M
BodoBERT	Bodo corpus	1.6M
FlairEmbeddings	Wiki+OPUS	≈29M
MuRIL	CommonCrawl+Wiki	788M
XLM-R	CC-100	1.7B
IndicBERT	Scraping	1.84B

Architecture Comparison

CRF vs Fine-tuning vs BiLSTM-CRF
Individual vs Stacked embedding methods

Implementation Details

Framework: Flair framework
Batch Size: 32
Early Stopping Strategy: Stop when validation performance plateaus
Learning Rate Schedule: Learning Rate Annealing

Experimental Results

Main Results

Architecture Comparison

Embedding Method	Tagging Model	F1-score(Micro)	F1-score(Weighted)
BodoBERT	CRF	0.7583	0.7454
BodoBERT	Fine-tuned BERT	0.7754	0.7775
BodoBERT	BiLSTM + CRF	0.7949	0.7898

Individual Method Language Model Comparison

Embedding Model	Bodo F1	Assamese F1
FastText	0.7686	0.6981
BytePair	0.7669	0.7099
BodoBERT	0.7949	0.7033
FlairEmbeddings	0.7885	0.7076
MuRIL	0.7708	0.7286
XLM-R	0.7638	0.7001
IndicBERT	0.7235	0.7293

Stacked Method Results

Stacked Embedding Combination	F1 Score
BodoBERT + FastText	0.7928
BodoBERT + BytePair	0.8041
BodoBERT + mBERT	0.799
BodoBERT + FlairEmbeddings	0.801
BodoBERT + MuRIL	0.785
BodoBERT + XLM-R	0.8003
BodoBERT + IndicBERT	0.793

Data Augmentation Experiments

By adding 10k automatically annotated and manually corrected sentences:

Performance Improvement: F1 increased from 0.8041 to 0.8494 (+1-2%)
Validates Model Scalability

Label-Level Analysis

Best model performance on major POS labels:

V_VM (Verb): F1=0.9150 (highest)
RD_PUNC (Punctuation): F1=0.9944 (near perfect)
N_NN (Noun): F1=0.7628 (largest class)
N_NNP (Proper Noun): F1=0.6946 (more difficult)

Error Analysis

Main error patterns discovered through confusion matrices:

Intra-Class Confusion: Common nouns (N_NN) vs proper nouns (N_NNP), locative nouns (N_NST)
Part-of-Speech Conversion: Difficulty in tagging nouns used as adjectives
Writing System Limitations: Bodo lacks capitalization conventions for proper nouns like English

Cross-Lingual Comparison

Bodo vs Assamese POS tagging results:

Bodo Best: 0.8041 (BodoBERT+BytePair)
Assamese Best: 0.7293 (IndicBERT)
Difference Rationale: Different label set complexity (Bodo 34 labels vs Assamese 41 labels)

Low-Resource Language POS Tagging

Assamese: Pathak et al. (2022, 2023) - BiLSTM-CRF achieves 86.52% F1
Khasi: Warjri et al. (2021) - 96.98% accuracy
Bengali: Alam et al. (2016) - 86.0% accuracy, Kabir et al. (2016) - 93.33% accuracy
Mizo: Pandey et al. (2022) - LSTM achieves 81.86% accuracy

Advantages of This Work

Originality: First neural network-based POS tagger for Bodo language
Systematicity: Comprehensive comparison of multiple architectures and language models
Practicality: Provides open-source models and tools

Conclusions and Discussion

Main Conclusions

BodoBERT Effectiveness: Specialized language models perform best on downstream tasks
Architecture Advantages: BiLSTM-CRF architecture outperforms CRF and Fine-tuning
Stacking Strategy Effectiveness: Combined embeddings outperform single embeddings
Baseline Establishment: Establishes important baseline for Bodo language NLP research

Limitations

Data Scale: Annotated corpus is relatively small (30k sentences)
Language Model Training Data: BodoBERT trained on only 1.6M tokens
Performance Level: Still lags behind high-resource languages (F1=0.8041 vs 90%+)
Annotation Quality: Some annotations may require further correction

Future Directions

Expand Corpus: Collect more Bodo language text and annotated data
Model Improvement: Optimize BodoBERT architecture and training strategies
Downstream Tasks: Extend to NER, syntactic parsing, and other NLP tasks
Multilingual Modeling: Explore joint modeling with related languages

In-Depth Evaluation

Strengths

Pioneering Contribution: First construction of language model and POS tagger for Bodo, filling an important gap
Systematic Research: Comprehensive comparison of multiple methods with well-designed experiments
Technical Innovation: Stacked embedding strategy effectively improves performance
Practical Value: Open-source release of models provides foundational tools for the community
Cross-Lingual Insights: Valuable cross-lingual analysis through Assamese comparison

Weaknesses

Data Limitations: Relatively small training data scale may affect model generalization
Evaluation Limitations: Lacks comparison with traditional methods (HMM, rule-based approaches)
Error Analysis Depth: Insufficient linguistic analysis of model failure cases
Computational Resources: High model training costs may limit reproducibility

Impact

Academic Value: Provides important paradigm for low-resource language NLP research
Practical Significance: Directly serves the practical needs of the Bodo language community
Methodological Contribution: Stacked embedding strategy generalizable to other low-resource languages
Infrastructure: Establishes foundation for subsequent Bodo language NLP research

Applicable Scenarios

Direct Application: Bodo language text processing, information extraction
Research Foundation: Preprocessing step for other Bodo language NLP tasks
Method Transfer: POS tagging tasks for similar low-resource languages
Multilingual Systems: Component of multilingual NLP systems for northeastern India

References

This paper cites abundant related work, primarily including:

BERT-related: Devlin et al. (2018) - Original BERT paper
Sequence Labeling: Huang et al. (2015) - BiLSTM-CRF architecture
Low-Resource Languages: Multiple Indian regional language NLP studies
Language Models: Original papers of various pre-trained models

Overall Assessment: This is a high-quality low-resource language NLP research paper with important contributions in methodological innovation, experimental design, and practical value. While constrained by data scale, it opens new directions for Bodo language NLP research with significant academic and social value.