2025-11-18T17:28:20.387006

Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Hussain, Qasim, Mehak et al.

The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.

academic

Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Basic Information

Paper ID: 2510.03683
Title: Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text
Authors: Nisar Hussain, Amna Qasim, Gull Mehak, Muhammad Usman, Muhammad Zain, Momina Hafeez, Grigori Sidorov
Institution: Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico
Classification: cs.CL (Computational Linguistics)
Paper Link: https://arxiv.org/abs/2510.03683

Abstract

This study addresses offensive language detection in Roman Urdu-English code-mixed text by proposing a fine-tuning framework for large language models using QLoRA. Given the challenges of Roman Urdu including grammatical irregularities, spelling inconsistencies, and scarcity of annotated data, the researchers employ Google Translate to convert code-mixed text into English to fully leverage the capabilities of English-based large language models. Experiments are conducted on multiple models including Meta-LLaMA-3-8B, Mistral-7B-v0.1, LLaMA 2-7B, ModernBERT, and RoBERTa. Results demonstrate that Meta-LLaMA-3-8B achieves the highest F1 score of 91.45%, while Mistral-7B reaches 89.66%, both surpassing traditional Transformer baseline models.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is offensive language detection in Roman Urdu-English code-mixed text. Roman Urdu is the primary form of digital communication in Pakistan and parts of India, where users write Urdu using Latin characters and frequently mix English vocabulary.

Problem Significance

Social Media Safety Requirements: With the proliferation of platforms such as Twitter, Facebook, and YouTube, the dissemination of offensive and harmful content has become increasingly severe. Identifying and reducing such content is crucial for maintaining digital health and preventing psychological harm to users.
Unique Challenges of Code-Mixed Languages: Roman Urdu-English code-mixed text exhibits non-standard grammar, spelling inconsistencies, and lack of annotated datasets. These characteristics significantly reduce the accuracy of traditional NLP models.

Limitations of Existing Methods

Traditional Machine Learning Approaches: Early methods using SVM, Naive Bayes, and logistic regression combined with TF-IDF or n-gram features demonstrate poor generalization across different contexts and languages, particularly on informal, noisy, or code-mixed data.
Deep Learning Models: While CNN and RNN outperform traditional methods in capturing contextual information, they still face challenges with morphologically rich, low-resource languages such as Roman Urdu.
Scarcity of Pre-trained Models: The absence of specialized pre-trained models or large-scale annotated corpora for Roman Urdu limits the application of existing methods.

Core Contributions

Proposes an end-to-end Roman Urdu-English offensive language detection pipeline: Constructs a complete processing workflow from data preprocessing to model evaluation.
Applies QLoRA to LLaMA and Mistral models: First application of Quantized Low-Rank Adaptation technology to Roman Urdu offensive language detection tasks.
Conducts comprehensive comparative evaluation: Compares the performance of QLoRA fine-tuned large language models with traditionally fine-tuned ModernBERT and RoBERTa models.
Adopts translation-based preprocessing strategy: Leverages English large language models to process low-resource code-mixed text through translation methods.

Methodology Details

Task Definition

Input: Roman Urdu-English code-mixed text Output: Binary classification labels (offensive/non-offensive) Constraints: Handling low-resource, non-standard grammar, and code-mixed characteristics

Model Architecture

Overall Pipeline

The research employs a systematic processing pipeline:

Data Collection and Preprocessing
- Dataset contains 46,026 samples (24,026 "offensive", 22,000 "non-offensive")
- Primarily scraped from public Facebook comments and YouTube replies
- Manually annotated by three bilingual annotators with Cohen's Kappa agreement of 0.86
Translation Processing
- Uses GoogleTranslator library from the deep_translator package
- Translates Roman Urdu text to English to leverage English LLMs
- Preserves original code-mixed characteristics until the translation stage
Dataset Partitioning and Annotation
- Label mapping: "offensive" → 1, "non-offensive" → 0
- Uses stratified sampling for 80% training and 20% test split
- For decoder models, input is formatted in prompt style

Model Selection

A diverse set of models is selected for performance evaluation:

Large Language Models: LLaMA 3 (8B), LLaMA 2 (7B), Mistral (7B), fine-tuned using QLoRA
Traditional Transformers: RoBERTa and ModernBERT, fine-tuned using conventional supervised learning methods

QLoRA Fine-tuning Technique

Core Parameter Settings:

rank (r=8)
alpha (32)
dropout (0.05)
Adaptation layers: q_proj and v_proj

Technical Advantages:

Achieves memory-efficient fine-tuning through low-rank adapters and quantized weights
Maintains performance while significantly reducing GPU memory consumption

Technical Innovations

Application of Quantized Low-Rank Adaptation: First application of QLoRA technology to Roman Urdu offensive language detection, enabling efficient fine-tuning of large models.
Translation-Assisted Cross-Lingual Transfer: Bridges language gaps through translation strategy, improving model understanding of underlying semantics.
Multi-Model Comparison Framework: Establishes a systematic comparative evaluation framework between LLMs and traditional Transformer models.

Experimental Setup

Dataset

Scale: 46,026 samples
Source: Facebook comments and YouTube replies
Annotation: Three bilingual annotators, Cohen's Kappa = 0.86
Split: 80% training, 20% testing (stratified sampling)
Preprocessing: Minimal cleaning to preserve contextual integrity

Evaluation Metrics

Accuracy
Precision
Recall
F1 Score

Baseline Methods

LLaMA 3 (8B) + QLoRA
Mistral 7B + QLoRA
LLaMA 2 (7B) + QLoRA
RoBERTa (traditional fine-tuning)
ModernBERT (traditional fine-tuning)

Implementation Details

Hardware: NVIDIA A100 (80GB VRAM), 128GB RAM, 32-core CPU
Software Environment: Python 3.13.2, PyTorch, Transformers, PEFT, etc.
Hyperparameters: Learning rate 2e-5, batch size 2, training epochs 10, weight decay 0.01
Optimization Strategies: Gradient checkpointing, early stopping mechanism

Experimental Results

Main Results

Model	Accuracy	Precision	Recall	F1 Score
LLaMA 3 (8B)	91.62	91.4	91.5	91.45
Mistral 7B	89.88	89.5	89.8	89.66
LLaMA 2 (7B)	88.74	88.2	88.6	88.4
RoBERTa	85.65	85.2	85.7	85.44
ModernBERT	83.92	83.1	84.0	83.55

Key Findings:

LLaMA 3 (8B) achieves the best performance with an F1 score of 91.45%
QLoRA-based large language models significantly outperform traditional Transformer models
Performance gaps demonstrate the advantages of QLoRA fine-tuning on code-mixed language tasks

Training Behavior Analysis

Convergence Speed: Optimal models reach peak validation F1 scores within 2-3 epochs
Training Stability: All models show smooth loss reduction with no signs of overfitting
Memory Efficiency: QLoRA significantly reduces memory requirements for large model fine-tuning

Inference Efficiency Comparison

LLaMA 3 (8B): Approximately 1.0 seconds/1000 samples
Mistral 7B: Approximately 0.80 seconds/1000 samples
LLaMA 2 (7B): Approximately 0.78 seconds/1000 samples
RoBERTa: Approximately 0.35 seconds/1000 samples
ModernBERT: Approximately 0.30 seconds/1000 samples

Reflects the trade-off relationship between model size and inference speed.

Model Interpretability Analysis

Through LIME and SHAP analysis, the following patterns are identified:

High-Impact Offensive Vocabulary: "saalon", "naacho", "maaregi", etc.
Model Decision Patterns: LLaMA 3 focuses on contextual offensive language, while traditional models distribute weights more broadly
Bias Identification: Certain neutral words may mislead classification, highlighting the importance of data quality

Offensive Language Detection Research

Traditional Methods: Machine learning approaches based on hand-crafted features (SVM, Naive Bayes, etc.)
Deep Learning Methods: CNN, RNN, and Transformer architectures (BERT and its variants)
Multilingual Processing: Cross-lingual transfer learning and zero-shot learning methods

Low-Resource Language Processing

Roman Urdu Research: Limited research constructing Roman Urdu datasets and embedding methods
Code-Mixed Processing: Multilingual embeddings and machine translation-assisted methods
Resource Scarcity Challenges: Lack of pre-trained models and large-scale annotated corpora

Large Language Model Fine-tuning

Parameter-Efficient Fine-tuning: Development of QLoRA, LoRA, and related techniques
LLM Applications: Application of GPT, LLaMA, and Mistral to text classification tasks
Quantization Techniques: Reducing computational resource requirements while maintaining performance

Conclusions and Discussion

Main Conclusions

Effectiveness of QLoRA Fine-tuning: QLoRA fine-tuned large language models significantly outperform traditional methods on Roman Urdu-English code-mixed offensive language detection tasks.
Feasibility of Translation Strategy: Translation preprocessing effectively enables the use of English LLMs for low-resource code-mixed language processing.
Importance of Model Scale: Larger model parameters demonstrate clear advantages on complex NLP tasks.

Limitations

Loss of Code-Mixed Features: The translation process results in loss of original code-switching structures; models actually process English translations rather than native code-mixed text.
Computational Resource Requirements: High inference latency of large language models may limit real-time applications.
Dataset Scale: Relatively small dataset size may impact model generalization capability.
Translation Quality Dependency: Method effectiveness is highly dependent on Google Translate quality.

Future Directions

Direct Code-Mixed Text Processing: Develop LLMs capable of directly processing Roman Urdu without translation.
Zero-Shot and Few-Shot Learning: Reduce dependence on annotated data.
Cross-Lingual Transfer Optimization: Improve cross-lingual transfer methods to better preserve code-mixed characteristics.
Real-Time Optimization: Optimize inference speed for practical deployment requirements.

In-Depth Evaluation

Strengths

Methodological Innovation: First application of QLoRA technology to Roman Urdu offensive language detection, providing novel solutions.
Experimental Comprehensiveness: Compares multiple models of different scales and architectures, providing comprehensive performance benchmarks.
Practical Value: Provides feasible technical solutions for social media content moderation.
Technical Advancement: Employs cutting-edge parameter-efficient fine-tuning techniques, achieving good performance in resource-constrained environments.

Weaknesses

Method Limitations: While practical, the translation preprocessing strategy loses the essential characteristics of code-mixing.
Dataset Constraints: Relatively small dataset sourced from specific platforms may impact generalization.
Evaluation Dimensions: Lacks fine-grained analysis of different types of offensive language.
Theoretical Contribution: Primarily engineering implementation with limited theoretical innovation.

Impact

Academic Contribution: Provides effective methods for offensive content detection in low-resource code-mixed languages.
Practical Application: Directly applicable to Roman Urdu social media content moderation.
Technology Promotion: Demonstrates the application potential of QLoRA in domain-specific tasks.
Research Inspiration: Provides reference framework for similar tasks in other low-resource languages.

Applicable Scenarios

Social Media Platforms: Roman Urdu content moderation on Facebook, Twitter, and similar platforms.
Online Community Management: Online forums and communities in Pakistan and India.
Educational Applications: Cyberbullying detection and prevention systems.
Research Foundation: Development basis for multilingual offensive language detection systems.

References

The paper cites 46 relevant references covering multiple domains including offensive language detection, large language models, and code-mixed language processing, providing solid theoretical foundation and technical support for the research.

Overall Assessment: This paper demonstrates mature technical implementation with reasonable experimental design and convincing results. While relatively limited in theoretical innovation, it provides valuable solutions for practical applications in low-resource code-mixed languages, demonstrating good practical value and promotion potential.