2025-11-18T09:52:19.958339

Part-of-Speech Tagger for Bodo Language using Deep Learning approach

Pathak, Narzary, Nandi et al.

Language Processing systems such as Part-of-speech tagging, Named entity recognition, Machine translation, Speech recognition, and Language modeling (LM) are well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model plays a vital role in the downstream tasks of modern NLP. Extensive studies are carried out on LMs for high-resource languages. Nevertheless, languages such as Bodo, Rabha, and Mising continue to lack coverage. In this study, we first present BodoBERT, a language model for the Bodo language. To the best of our knowledge, this work is the first such effort to develop a language model for Bodo. Secondly, we present an ensemble DL-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with CRF and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several language models in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.

academic

Part-of-Speech Tagger for Bodo Language using Deep Learning approach

基本信息

论文ID: 2401.03175
标题: Part-of-Speech Tagger for Bodo Language using Deep Learning approach
作者: Dhrubajyoti Pathak, Sanjib Narzary, Sukumar Nandi, Bidisha Som
机构: Centre for Linguistic Science and Technology, IIT Guwahati
分类: cs.CL cs.AI cs.LG
发表期刊: Natural Language Engineering (Accepted)
论文链接: https://arxiv.org/abs/2401.03175

核心问题：Bodo语言作为印度东北部的重要语言（150万使用者，印度第20大语言），缺乏基础的NLP工具和资源
技术挑战：
- 缺乏预训练语言模型覆盖Bodo语言
- 标注数据稀缺（仅有约30k句子的标注语料）
- 语言特性复杂（Tibeto-Burman语系，形态丰富）

重要性分析

语言地位：Bodo是印度22种官方语言之一，Bodoland Territorial Region的官方语言
应用需求：150万使用者急需相应的NLP工具支持
学术价值：填补低资源语言NLP研究的空白

现有局限

基础NLP任务（词法分析、依存句法分析、语言识别等）尚未开展
无可用的预训练语言模型
缺乏基于深度学习的下游NLP工具

核心贡献

首个Bodo语言模型：基于BERT架构提出BodoBERT，这是首个专门为Bodo语言训练的预训练语言模型
多架构POS标注器对比：系统比较了CRF、Fine-tuning、BiLSTM-CRF三种序列标注架构
多语言模型性能分析：评估了FastText、BPE、XLM-R、FlairEmbedding、IndicBERT、MuRIL等多种语言模型在Bodo POS标注任务上的表现
堆叠嵌入方法：提出Individual和Stacked两种嵌入方法，Stacked方法显著提升性能
开源资源：公开发布最佳POS标注模型和BodoBERT模型

数据来源：
- Linguistic Data Consortium for Indian Languages (LDC-IL)
- Narzary et al. (2022)的工作
语料规模：1.6M tokens, 191k sentences
领域覆盖：美学、商业、大众媒体、科技、社会科学等多领域

模型架构

基础架构：多层双向Transformer（基于BERT框架）
关键参数：
- 6层Transformer块
- 隐藏层维度：768
- 自注意力头数：6
- 参数总量：约103M
- 词汇表大小：50,000（WordPiece tokenizer）

训练设置

硬件：Nvidia Tesla P100 GPU
训练步数：300K steps
序列长度：128
批大小：64
优化器：Adam (学习率2e-5，前3000步warm-up)
训练时间：约7天

POS标注模型架构

三种序列标注方法

CRF模型：使用BodoBERT嵌入 + CRF层
Fine-tuning模型：直接微调BodoBERT进行POS标注
BiLSTM-CRF模型：BodoBERT嵌入 + BiLSTM + CRF层

嵌入方法

Individual方法：单独使用各种语言模型
Stacked方法：将BodoBERT与其他语言模型堆叠组合

技术创新点

语言适应性：针对Bodo语言特点设计的首个专用语言模型
多模型融合：系统性比较和融合多种预训练模型
跨语言迁移：利用相同文字系统（Devanagari）的Hindi模型进行知识迁移
堆叠策略：创新性地将专用语言模型与通用模型结合

实验设置

数据集

标注语料：Bodo Monolingual Text Corpus (ILCI-II)
数据规模：
- 训练集：24,003句，192k tokens
- 验证集：2,325句，23k tokens
- 测试集：3,161句，23k tokens
标签体系：BIS标签集，11个顶层类别，34个具体标签
数据格式：CoNLL-2003格式

评价指标

主要指标：F1-score (Micro)
辅助指标：F1-score (Weighted)、Precision、Recall
标签级别分析：每个POS标签的详细性能

对比方法

语言模型对比

模型	训练语料	数据量
FastText	Wiki	<29M
BytePair	Wiki	29M
BodoBERT	Bodo corpus	1.6M
FlairEmbeddings	Wiki+OPUS	≈29M
MuRIL	CommonCrawl+Wiki	788M
XLM-R	CC-100	1.7B
IndicBERT	Scraping	1.84B

架构对比

CRF vs Fine-tuning vs BiLSTM-CRF
Individual vs Stacked embedding methods

实现细节

框架：Flair framework
批大小：32
早停策略：验证集性能无提升时停止
学习率调度：Learning Rate Annealing

嵌入方法	标注模型	F1-score(Micro)	F1-score(Weighted)
BodoBERT	CRF	0.7583	0.7454
BodoBERT	Fine-tuned BERT	0.7754	0.7775
BodoBERT	BiLSTM + CRF	0.7949	0.7898

Individual方法语言模型比较

嵌入模型	Bodo F1	Assamese F1
FastText	0.7686	0.6981
BytePair	0.7669	0.7099
BodoBERT	0.7949	0.7033
FlairEmbeddings	0.7885	0.7076
MuRIL	0.7708	0.7286
XLM-R	0.7638	0.7001
IndicBERT	0.7235	0.7293

Stacked方法结果

堆叠嵌入组合	F1 score
BodoBERT + FastText	0.7928
BodoBERT + BytePair	0.8041
BodoBERT + mBERT	0.799
BodoBERT + FlairEmbeddings	0.801
BodoBERT + MuRIL	0.785
BodoBERT + XLM-R	0.8003
BodoBERT + IndicBERT	0.793