2025-11-15T08:46:11.807319

Part-of-speech tagging for Nagamese Language using CRF

Shohe, Khiamungam, Angami

This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.

academic

Part-of-speech tagging for Nagamese Language using CRF

基本信息

论文ID: 2509.19343
标题: Part-of-speech tagging for Nagamese Language using CRF
作者: Alovi N Shohe, Chonglio Khiamungam, Teisovi Angami
单位: Department of Information Technology, Nagaland University, Kohima Campus, India
分类: cs.CL cs.AI
发表时间: 2025年10月13日 (arXiv v3)
论文链接: https://arxiv.org/abs/2509.19343

摘要

本文研究了Nagamese语言的词性标注任务，这是自然语言处理(NLP)中的重要任务。Nagamese语言，又称Naga Pidgin，是一种以阿萨姆语为词汇基础的克里奥尔语言，主要作为印度东北部那加人与阿萨姆人之间贸易交流的通信手段而发展起来。虽然英语、印地语等资源丰富的语言在词性标注方面已有大量工作，但Nagamese语言在此领域尚无相关研究。据作者所知，这是首次针对Nagamese语言进行词性标注的尝试。研究创建了包含16,112个标记的标注语料库，并应用条件随机场(CRF)机器学习技术，实现了85.70%的整体标注准确率，精确率和召回率均为86%，F1分数为85%。

研究背景与动机

问题定义

本研究要解决Nagamese语言缺乏词性标注工具的问题。词性标注是NLP的基础任务，涉及为句子中的每个词分配适当的词性标签。

重要性

语言保护: Nagamese作为那加兰邦的通用语言，在大众媒体、新闻、广播和政府媒体中广泛使用
资源稀缺: Nagamese属于资源贫乏语言，缺乏语言处理工具和资源
基础应用: 词性标注是构建其他NLP应用(如情感分析、机器翻译)的基础

现有局限性

主流NLP工具主要针对资源丰富的语言(如英语、印地语)开发
Nagamese语言此前完全没有词性标注相关工作
缺乏标准化的标注语料库和标签集

核心贡献

首创性研究: 首次针对Nagamese语言进行词性标注研究
标签集设计: 基于Penn Treebank标签集，设计了适合Nagamese的15个词性标签
语料库构建: 创建了包含16,115个标记的手工标注语料库
基线模型: 使用CRF技术建立了Nagamese词性标注的基线模型
性能评估: 提供了详细的错误分析和性能评估

方法详解

任务定义

给定Nagamese语言的句子，为每个词分配相应的词性标签。

输入: Nagamese句子中的词序列输出: 对应的词性标签序列示例:

Itu/ADJECTIVE dikhikena/VERB Isor/NOUN khusi/ADJECTIVE lagise/VERB ./SYM
(God was pleased with what He saw.)

Nagamese语言特点

字符集

元音: i, u, e, @, o, a (6个)
辅音: p, t, c, k, b, d, j, g, ph, th, ch, kh, m, n, ṅ, s, š, h, r, I, w, y (22个)

音节模式

单音节: (C)(C)V(C)(C)，但V不能单独出现
双音节: V(C)(C)(C)V(C) 或 (C)CV(C)(C)CV(C)(C)
三音节: V(C)(C)CV(C)(C)CV(C) 或 (C)CV(C)(C)V(C)(C)(C)V(C)
四音节: (C)V(C)CVCV(C)CV(C)
无五音节词(除明显复合词外)

标签集设计

从Penn Treebank的36个标签简化为15个适合Nagamese的标签：

序号	类别	标签
1	形容词	ADJ
2	副词	ADV
3	连词	CONJ
4	补语标记	CMP
5	限定词	DET
6	后置词/前置词	PP
7	感叹词	INTJ
8	名词	N
9	代词	PN
10	量词	QN
11	动词	V
12	外来词	FW
13	符号	SYM
14	未知词	UNK
15	数词	NUM