Part-of-speech tagging for Nagamese Language using CRF
Shohe, Khiamungam, Angami
This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved.
Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.
academic
Part-of-speech tagging for Nagamese Language using CRF
This paper investigates part-of-speech (POS) tagging for the Nagamese language, an important task in natural language processing (NLP). Nagamese, also known as Naga Pidgin, is a creole language with Assamese lexical foundations that developed primarily as a communication medium for trade between the Naga people and Assamese speakers in northeastern India. While resource-rich languages such as English and Hindi have extensive POS tagging research, Nagamese has previously lacked any such work. To the authors' knowledge, this represents the first attempt at POS tagging for the Nagamese language. The study created an annotated corpus containing 16,112 tokens and applied Conditional Random Fields (CRF) machine learning technology, achieving an overall tagging accuracy of 85.70%, with precision and recall both at 86%, and an F1 score of 85%.
This research addresses the lack of POS tagging tools for the Nagamese language. POS tagging is a fundamental NLP task involving the assignment of appropriate part-of-speech labels to each word in a sentence.
Language Preservation: Nagamese serves as the lingua franca of Nagaland, widely used in mass media, news, broadcasting, and government communications
Resource Scarcity: Nagamese is a low-resource language lacking language processing tools and resources
Foundational Application: POS tagging serves as a foundation for constructing other NLP applications such as sentiment analysis and machine translation
Employed a linear-chain CRF model capable of considering contextual information from adjacent tags in the sequence, overcoming the label bias problem inherent in Maximum Entropy Markov Models (MEMM).
Data Source: Articles collected from local newspaper "Nagamese Khobor," containing diverse content including current affairs and sports
Corpus Scale: Approximately 26,000 words of raw corpus, with 16,115 tokens manually annotated (749 sentences)
Annotation Process: Manual annotation performed by native Nagamese speakers
Quality Verification: A second annotator labeled 1,864 tokens for verification, with disagreement rate of 6.7% including foreign words, and only 1.23% excluding foreign words
Sreedhar, M. V. (1985). Standardized grammar of naga pidgin. - Nagamese grammar standardization study
Saharia et al. (2009). Part of speech tagger for assamese text. - Pioneering work in Assamese POS tagging
Pathak et al. (2022, 2023). Deep learning methods for Assamese POS tagging
Phukan et al. (2023, 2024). LSTM-based Assamese POS tagging research
Overall Assessment: This paper holds significant pioneering value. While employing relatively traditional technical methods, it establishes the first POS tagging system for Nagamese, a low-resource language, with important academic and social significance. The research methodology is rigorous, data construction is standardized, and it provides a solid foundation for subsequent research.