2025-11-15T08:46:11.807319

Part-of-speech tagging for Nagamese Language using CRF

Shohe, Khiamungam, Angami
This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.
academic

Part-of-speech tagging for Nagamese Language using CRF

Basic Information

  • Paper ID: 2509.19343
  • Title: Part-of-speech tagging for Nagamese Language using CRF
  • Authors: Alovi N Shohe, Chonglio Khiamungam, Teisovi Angami
  • Institution: Department of Information Technology, Nagaland University, Kohima Campus, India
  • Classification: cs.CL cs.AI
  • Publication Date: October 13, 2025 (arXiv v3)
  • Paper Link: https://arxiv.org/abs/2509.19343

Abstract

This paper investigates part-of-speech (POS) tagging for the Nagamese language, an important task in natural language processing (NLP). Nagamese, also known as Naga Pidgin, is a creole language with Assamese lexical foundations that developed primarily as a communication medium for trade between the Naga people and Assamese speakers in northeastern India. While resource-rich languages such as English and Hindi have extensive POS tagging research, Nagamese has previously lacked any such work. To the authors' knowledge, this represents the first attempt at POS tagging for the Nagamese language. The study created an annotated corpus containing 16,112 tokens and applied Conditional Random Fields (CRF) machine learning technology, achieving an overall tagging accuracy of 85.70%, with precision and recall both at 86%, and an F1 score of 85%.

Research Background and Motivation

Problem Definition

This research addresses the lack of POS tagging tools for the Nagamese language. POS tagging is a fundamental NLP task involving the assignment of appropriate part-of-speech labels to each word in a sentence.

Significance

  1. Language Preservation: Nagamese serves as the lingua franca of Nagaland, widely used in mass media, news, broadcasting, and government communications
  2. Resource Scarcity: Nagamese is a low-resource language lacking language processing tools and resources
  3. Foundational Application: POS tagging serves as a foundation for constructing other NLP applications such as sentiment analysis and machine translation

Existing Limitations

  • Mainstream NLP tools are primarily developed for resource-rich languages (e.g., English, Hindi)
  • Nagamese language has previously had no related POS tagging research
  • Lack of standardized annotated corpora and tagsets

Core Contributions

  1. Pioneering Research: First POS tagging study for the Nagamese language
  2. Tagset Design: Designed 15 part-of-speech tags adapted for Nagamese based on the Penn Treebank tagset
  3. Corpus Construction: Created a manually annotated corpus containing 16,115 tokens
  4. Baseline Model: Established a baseline POS tagging model for Nagamese using CRF technology
  5. Performance Evaluation: Provided detailed error analysis and performance assessment

Methodology

Task Definition

Given a sentence in Nagamese, assign corresponding POS tags to each word.

Input: Word sequence in a Nagamese sentence Output: Corresponding POS tag sequence Example:

Itu/ADJECTIVE dikhikena/VERB Isor/NOUN khusi/ADJECTIVE lagise/VERB ./SYM
(God was pleased with what He saw.)

Nagamese Language Characteristics

Character Set

  • Vowels: i, u, e, @, o, a (6 total)
  • Consonants: p, t, c, k, b, d, j, g, ph, th, ch, kh, m, n, ṅ, s, š, h, r, I, w, y (22 total)

Syllable Patterns

  • Monosyllabic: (C)(C)V(C)(C), where V cannot appear alone
  • Bisyllabic: V(C)(C)(C)V(C) or (C)CV(C)(C)CV(C)(C)
  • Trisyllabic: V(C)(C)CV(C)(C)CV(C) or (C)CV(C)(C)V(C)(C)(C)V(C)
  • Tetrasyllabic: (C)V(C)CVCV(C)CV(C)
  • No pentasyllabic words exist (except obvious compounds)

Tagset Design

Simplified from Penn Treebank's 36 tags to 15 tags suitable for Nagamese:

No.CategoryTag
1AdjectiveADJ
2AdverbADV
3ConjunctionCONJ
4ComplementizerCMP
5DeterminerDET
6Postposition/PrepositionPP
7InterjectionINTJ
8NounN
9PronounPN
10QuantifierQN
11VerbV
12Foreign WordFW
13SymbolSYM
14UnknownUNK
15NumeralNUM

Model Architecture

Conditional Random Fields (CRF)

Employed a linear-chain CRF model capable of considering contextual information from adjacent tags in the sequence, overcoming the label bias problem inherent in Maximum Entropy Markov Models (MEMM).

Feature Engineering

Designed a rich feature set including:

  • Current word
  • Whether word is at sentence beginning/end
  • Capitalization information
  • Prefixes (length ≤ 3) and suffixes (length ≤ 4)
  • Previous and next words
  • Presence of hyphens
  • Numeric content
  • Presence of uppercase letters within words

Optimization Settings

  • Gradient descent: L-BFGS method
  • Iterations: 100
  • Regularization: L1 and L2 regularization to prevent overfitting

Experimental Setup

Dataset Construction

  1. Data Source: Articles collected from local newspaper "Nagamese Khobor," containing diverse content including current affairs and sports
  2. Corpus Scale: Approximately 26,000 words of raw corpus, with 16,115 tokens manually annotated (749 sentences)
  3. Annotation Process: Manual annotation performed by native Nagamese speakers
  4. Quality Verification: A second annotator labeled 1,864 tokens for verification, with disagreement rate of 6.7% including foreign words, and only 1.23% excluding foreign words

Data Distribution

Label frequency distribution reveals data imbalance:

  • Highest frequency: FW (Foreign Words) - 3,744 occurrences
  • Second: PP (Postpositions) - 2,418 occurrences
  • Lowest frequency: CMP (Complementizer) - 35 occurrences

Evaluation Metrics

  • Accuracy: Overall tagging correctness rate
  • Precision: TP/(TP+FP)
  • Recall: TP/(TP+FN)
  • F1 Score: 2×(Precision×Recall)/(Precision+Recall)

Experimental Configuration

  • Train/test split: 70:30
  • Implementation tool: sklearn-crfsuite library

Experimental Results

Main Results

MetricValue
Overall Accuracy85.70%
Average Precision86%
Average Recall86%
Average F1 Score85%

Per-Tag Performance Analysis

Best Performance:

  • SYM (Symbol): F1=0.99, Precision=0.99, Recall=0.98
  • NUM (Numeral): F1=0.95, Precision=0.99, Recall=0.92
  • CONJ (Conjunction): F1=0.91, Precision=0.95, Recall=0.87

Weaker Performance:

  • UNK (Unknown): F1=0.33, Precision=0.77, Recall=0.21
  • N (Noun): F1=0.70, Precision=0.70, Recall=0.69
  • ADV (Adverb): F1=0.71, Precision=0.74, Recall=0.69

Error Analysis

Primary error patterns include:

  1. ADJ misclassified as: PP (15 times), V (15 times), N (12 times), FW (11 times)
  2. N misclassified as: FW (76 times), PP (26 times), V (23 times)
  3. FW misclassified as: N (81 times), indicating challenges in foreign word recognition

Transition Pattern Analysis

  • Most likely transition: UNK → UNK
  • Least likely transition: PP → NUM

Since Nagamese is a creole language with Assamese lexical base, the paper reviews related work on Assamese POS tagging:

  1. Saharia et al. (2009): HMM-based approach, 172 tags, 10k word training, 87% accuracy
  2. Phukan et al. (2024): Character-level LSTM and Bi-LSTM, 60k words, 93.36% accuracy
  3. Pathak et al. (2023): BiLSTM-CRF architecture, 404k tokens, F1=0.925
  4. Talukdar et al. (2024): RNN and GRU, 30k words, F1=94.56%

These works provide technical references for the current research, though Nagamese as a creole language exhibits unique linguistic characteristics.

Conclusions and Discussion

Main Conclusions

  1. Successfully established the first baseline system for Nagamese POS tagging
  2. CRF model achieved reasonable performance on this task (85.70% accuracy)
  3. The created annotated corpus provides a foundation for subsequent research

Limitations

  1. Tagset Size: Only 15 tags used, potentially insufficient to capture language complexity
  2. Data Scale: 16,115 tokens is relatively small, potentially affecting model generalization
  3. Data Imbalance: Certain tags (e.g., CMP) have extremely limited samples, affecting model learning
  4. Foreign Word Challenge: High frequency and confusion of FW tags indicate foreign word recognition as a major difficulty

Future Directions

  1. Expand Tagset: Add more fine-grained POS tags
  2. Increase Data Volume: Expand the annotated corpus
  3. Application Extension: Apply the POS tagger to sentiment analysis, machine translation, and other applications
  4. Transfer Learning: Explore transfer learning methods from Assamese
  5. Deep Learning: Experiment with modern deep learning methods such as LSTM and BERT

In-Depth Evaluation

Strengths

  1. Pioneering Significance: Fills a gap in Nagamese language NLP research
  2. Linguistic Analysis: Detailed description of Nagamese linguistic features (phonology, syllable structure, etc.)
  3. Annotation Quality: Ensured data quality through dual annotation verification
  4. Error Analysis: Provided detailed confusion matrices and error pattern analysis
  5. Practical Value: Serves as a model for NLP research on low-resource languages

Weaknesses

  1. Methodological Limitations: Only employed traditional CRF methods without attempting modern deep learning techniques
  2. Insufficient Comparison: Lacks comparative experiments with other methods
  3. Data Skew: High proportion of foreign words (23%) may affect practical applicability
  4. Feature Engineering: Relatively simple features that may miss important linguistic characteristics
  5. Evaluation Limitations: Evaluated only on a single dataset, lacking cross-domain validation

Impact

  1. Academic Contribution: Provides important reference for low-resource language NLP research
  2. Social Value: Contributes to digital preservation and development of Nagamese language
  3. Technical Foundation: Establishes basis for constructing more complex Nagamese NLP applications
  4. Methodological Contribution: Demonstrates a complete workflow for building NLP tools for resource-scarce languages

Applicable Scenarios

  1. Educational Applications: Assists Nagamese language teaching and learning
  2. Media Processing: Automates processing of Nagamese news and social media content
  3. Government Services: Supports multilingual government services in Nagaland
  4. Research Foundation: Provides foundational tools for further Nagamese NLP research

References

The paper cites the following key literature:

  1. Sreedhar, M. V. (1985). Standardized grammar of naga pidgin. - Nagamese grammar standardization study
  2. Saharia et al. (2009). Part of speech tagger for assamese text. - Pioneering work in Assamese POS tagging
  3. Pathak et al. (2022, 2023). Deep learning methods for Assamese POS tagging
  4. Phukan et al. (2023, 2024). LSTM-based Assamese POS tagging research

Overall Assessment: This paper holds significant pioneering value. While employing relatively traditional technical methods, it establishes the first POS tagging system for Nagamese, a low-resource language, with important academic and social significance. The research methodology is rigorous, data construction is standardized, and it provides a solid foundation for subsequent research.