The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.
- Paper ID: 2510.03683
- Title: Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text
- Authors: Nisar Hussain, Amna Qasim, Gull Mehak, Muhammad Usman, Muhammad Zain, Momina Hafeez, Grigori Sidorov
- Institution: Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico
- Classification: cs.CL (Computational Linguistics)
- Paper Link: https://arxiv.org/abs/2510.03683
This study addresses offensive language detection in Roman Urdu-English code-mixed text by proposing a fine-tuning framework for large language models using QLoRA. Given the challenges of Roman Urdu including grammatical irregularities, spelling inconsistencies, and scarcity of annotated data, the researchers employ Google Translate to convert code-mixed text into English to fully leverage the capabilities of English-based large language models. Experiments are conducted on multiple models including Meta-LLaMA-3-8B, Mistral-7B-v0.1, LLaMA 2-7B, ModernBERT, and RoBERTa. Results demonstrate that Meta-LLaMA-3-8B achieves the highest F1 score of 91.45%, while Mistral-7B reaches 89.66%, both surpassing traditional Transformer baseline models.
The core problem addressed in this research is offensive language detection in Roman Urdu-English code-mixed text. Roman Urdu is the primary form of digital communication in Pakistan and parts of India, where users write Urdu using Latin characters and frequently mix English vocabulary.
- Social Media Safety Requirements: With the proliferation of platforms such as Twitter, Facebook, and YouTube, the dissemination of offensive and harmful content has become increasingly severe. Identifying and reducing such content is crucial for maintaining digital health and preventing psychological harm to users.
- Unique Challenges of Code-Mixed Languages: Roman Urdu-English code-mixed text exhibits non-standard grammar, spelling inconsistencies, and lack of annotated datasets. These characteristics significantly reduce the accuracy of traditional NLP models.
- Traditional Machine Learning Approaches: Early methods using SVM, Naive Bayes, and logistic regression combined with TF-IDF or n-gram features demonstrate poor generalization across different contexts and languages, particularly on informal, noisy, or code-mixed data.
- Deep Learning Models: While CNN and RNN outperform traditional methods in capturing contextual information, they still face challenges with morphologically rich, low-resource languages such as Roman Urdu.
- Scarcity of Pre-trained Models: The absence of specialized pre-trained models or large-scale annotated corpora for Roman Urdu limits the application of existing methods.
- Proposes an end-to-end Roman Urdu-English offensive language detection pipeline: Constructs a complete processing workflow from data preprocessing to model evaluation.
- Applies QLoRA to LLaMA and Mistral models: First application of Quantized Low-Rank Adaptation technology to Roman Urdu offensive language detection tasks.
- Conducts comprehensive comparative evaluation: Compares the performance of QLoRA fine-tuned large language models with traditionally fine-tuned ModernBERT and RoBERTa models.
- Adopts translation-based preprocessing strategy: Leverages English large language models to process low-resource code-mixed text through translation methods.
Input: Roman Urdu-English code-mixed text
Output: Binary classification labels (offensive/non-offensive)
Constraints: Handling low-resource, non-standard grammar, and code-mixed characteristics
The research employs a systematic processing pipeline:
- Data Collection and Preprocessing
- Dataset contains 46,026 samples (24,026 "offensive", 22,000 "non-offensive")
- Primarily scraped from public Facebook comments and YouTube replies
- Manually annotated by three bilingual annotators with Cohen's Kappa agreement of 0.86
- Translation Processing
- Uses GoogleTranslator library from the deep_translator package
- Translates Roman Urdu text to English to leverage English LLMs
- Preserves original code-mixed characteristics until the translation stage
- Dataset Partitioning and Annotation
- Label mapping: "offensive" → 1, "non-offensive" → 0
- Uses stratified sampling for 80% training and 20% test split
- For decoder models, input is formatted in prompt style
A diverse set of models is selected for performance evaluation:
- Large Language Models: LLaMA 3 (8B), LLaMA 2 (7B), Mistral (7B), fine-tuned using QLoRA
- Traditional Transformers: RoBERTa and ModernBERT, fine-tuned using conventional supervised learning methods
Core Parameter Settings:
- rank (r=8)
- alpha (32)
- dropout (0.05)
- Adaptation layers: q_proj and v_proj
Technical Advantages:
- Achieves memory-efficient fine-tuning through low-rank adapters and quantized weights
- Maintains performance while significantly reducing GPU memory consumption
- Application of Quantized Low-Rank Adaptation: First application of QLoRA technology to Roman Urdu offensive language detection, enabling efficient fine-tuning of large models.
- Translation-Assisted Cross-Lingual Transfer: Bridges language gaps through translation strategy, improving model understanding of underlying semantics.
- Multi-Model Comparison Framework: Establishes a systematic comparative evaluation framework between LLMs and traditional Transformer models.
- Scale: 46,026 samples
- Source: Facebook comments and YouTube replies
- Annotation: Three bilingual annotators, Cohen's Kappa = 0.86
- Split: 80% training, 20% testing (stratified sampling)
- Preprocessing: Minimal cleaning to preserve contextual integrity
- Accuracy
- Precision
- Recall
- F1 Score
- LLaMA 3 (8B) + QLoRA
- Mistral 7B + QLoRA
- LLaMA 2 (7B) + QLoRA
- RoBERTa (traditional fine-tuning)
- ModernBERT (traditional fine-tuning)
- Hardware: NVIDIA A100 (80GB VRAM), 128GB RAM, 32-core CPU
- Software Environment: Python 3.13.2, PyTorch, Transformers, PEFT, etc.
- Hyperparameters: Learning rate 2e-5, batch size 2, training epochs 10, weight decay 0.01
- Optimization Strategies: Gradient checkpointing, early stopping mechanism
| Model | Accuracy | Precision | Recall | F1 Score |
|---|
| LLaMA 3 (8B) | 91.62 | 91.4 | 91.5 | 91.45 |
| Mistral 7B | 89.88 | 89.5 | 89.8 | 89.66 |
| LLaMA 2 (7B) | 88.74 | 88.2 | 88.6 | 88.4 |
| RoBERTa | 85.65 | 85.2 | 85.7 | 85.44 |
| ModernBERT | 83.92 | 83.1 | 84.0 | 83.55 |
Key Findings:
- LLaMA 3 (8B) achieves the best performance with an F1 score of 91.45%
- QLoRA-based large language models significantly outperform traditional Transformer models
- Performance gaps demonstrate the advantages of QLoRA fine-tuning on code-mixed language tasks
- Convergence Speed: Optimal models reach peak validation F1 scores within 2-3 epochs
- Training Stability: All models show smooth loss reduction with no signs of overfitting
- Memory Efficiency: QLoRA significantly reduces memory requirements for large model fine-tuning
- LLaMA 3 (8B): Approximately 1.0 seconds/1000 samples
- Mistral 7B: Approximately 0.80 seconds/1000 samples
- LLaMA 2 (7B): Approximately 0.78 seconds/1000 samples
- RoBERTa: Approximately 0.35 seconds/1000 samples
- ModernBERT: Approximately 0.30 seconds/1000 samples
Reflects the trade-off relationship between model size and inference speed.
Through LIME and SHAP analysis, the following patterns are identified:
- High-Impact Offensive Vocabulary: "saalon", "naacho", "maaregi", etc.
- Model Decision Patterns: LLaMA 3 focuses on contextual offensive language, while traditional models distribute weights more broadly
- Bias Identification: Certain neutral words may mislead classification, highlighting the importance of data quality
- Traditional Methods: Machine learning approaches based on hand-crafted features (SVM, Naive Bayes, etc.)
- Deep Learning Methods: CNN, RNN, and Transformer architectures (BERT and its variants)
- Multilingual Processing: Cross-lingual transfer learning and zero-shot learning methods
- Roman Urdu Research: Limited research constructing Roman Urdu datasets and embedding methods
- Code-Mixed Processing: Multilingual embeddings and machine translation-assisted methods
- Resource Scarcity Challenges: Lack of pre-trained models and large-scale annotated corpora
- Parameter-Efficient Fine-tuning: Development of QLoRA, LoRA, and related techniques
- LLM Applications: Application of GPT, LLaMA, and Mistral to text classification tasks
- Quantization Techniques: Reducing computational resource requirements while maintaining performance
- Effectiveness of QLoRA Fine-tuning: QLoRA fine-tuned large language models significantly outperform traditional methods on Roman Urdu-English code-mixed offensive language detection tasks.
- Feasibility of Translation Strategy: Translation preprocessing effectively enables the use of English LLMs for low-resource code-mixed language processing.
- Importance of Model Scale: Larger model parameters demonstrate clear advantages on complex NLP tasks.
- Loss of Code-Mixed Features: The translation process results in loss of original code-switching structures; models actually process English translations rather than native code-mixed text.
- Computational Resource Requirements: High inference latency of large language models may limit real-time applications.
- Dataset Scale: Relatively small dataset size may impact model generalization capability.
- Translation Quality Dependency: Method effectiveness is highly dependent on Google Translate quality.
- Direct Code-Mixed Text Processing: Develop LLMs capable of directly processing Roman Urdu without translation.
- Zero-Shot and Few-Shot Learning: Reduce dependence on annotated data.
- Cross-Lingual Transfer Optimization: Improve cross-lingual transfer methods to better preserve code-mixed characteristics.
- Real-Time Optimization: Optimize inference speed for practical deployment requirements.
- Methodological Innovation: First application of QLoRA technology to Roman Urdu offensive language detection, providing novel solutions.
- Experimental Comprehensiveness: Compares multiple models of different scales and architectures, providing comprehensive performance benchmarks.
- Practical Value: Provides feasible technical solutions for social media content moderation.
- Technical Advancement: Employs cutting-edge parameter-efficient fine-tuning techniques, achieving good performance in resource-constrained environments.
- Method Limitations: While practical, the translation preprocessing strategy loses the essential characteristics of code-mixing.
- Dataset Constraints: Relatively small dataset sourced from specific platforms may impact generalization.
- Evaluation Dimensions: Lacks fine-grained analysis of different types of offensive language.
- Theoretical Contribution: Primarily engineering implementation with limited theoretical innovation.
- Academic Contribution: Provides effective methods for offensive content detection in low-resource code-mixed languages.
- Practical Application: Directly applicable to Roman Urdu social media content moderation.
- Technology Promotion: Demonstrates the application potential of QLoRA in domain-specific tasks.
- Research Inspiration: Provides reference framework for similar tasks in other low-resource languages.
- Social Media Platforms: Roman Urdu content moderation on Facebook, Twitter, and similar platforms.
- Online Community Management: Online forums and communities in Pakistan and India.
- Educational Applications: Cyberbullying detection and prevention systems.
- Research Foundation: Development basis for multilingual offensive language detection systems.
The paper cites 46 relevant references covering multiple domains including offensive language detection, large language models, and code-mixed language processing, providing solid theoretical foundation and technical support for the research.
Overall Assessment: This paper demonstrates mature technical implementation with reasonable experimental design and convincing results. While relatively limited in theoretical innovation, it provides valuable solutions for practical applications in low-resource code-mixed languages, demonstrating good practical value and promotion potential.