2025-11-18T17:28:20.387006

Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Hussain, Qasim, Mehak et al.

The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.

academic

Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

基本信息

论文ID: 2510.03683
标题: Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text
作者: Nisar Hussain, Amna Qasim, Gull Mehak, Muhammad Usman, Muhammad Zain, Momina Hafeez, Grigori Sidorov
机构: Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico
分类: cs.CL (Computational Linguistics)
论文链接: https://arxiv.org/abs/2510.03683

摘要

本研究针对Roman Urdu-English代码混合文本中的攻击性语言检测问题，提出了基于QLoRA的大语言模型微调框架。由于Roman Urdu语言存在语法不规范、拼写不一致和标注数据稀缺等挑战，研究者采用Google翻译将代码混合文本转换为英文，以充分利用英文大语言模型的能力。实验在多个模型上进行，包括Meta-LLaMA-3-8B、Mistral-7B-v0.1、LLaMA 2-7B、ModernBERT和RoBERTa。结果显示，Meta-LLaMA-3-8B取得了91.45%的最高F1分数，Mistral-7B达到89.66%，均超越了传统Transformer基线模型。

社交媒体安全需求：随着Twitter、Facebook、YouTube等平台的普及，攻击性和有害内容的传播日益严重，识别和减少此类内容对维护数字化健康和防止用户心理伤害至关重要。
代码混合语言的特殊挑战：Roman Urdu-English代码混合文本具有非标准语法、拼写不一致、缺乏标注数据集等特点，这些特征使传统NLP模型的准确率显著降低。

现有方法局限性

传统机器学习方法：早期使用SVM、朴素贝叶斯、逻辑回归等方法结合TF-IDF或n-gram特征，但在不同语境和语言间泛化能力差，特别是在非正式、噪声或代码混合数据上表现不佳。
深度学习模型：CNN和RNN虽然在上下文信息捕获方面优于传统方法，但对于形态丰富的低资源语言如Roman Urdu仍面临挑战。
预训练模型稀缺：Roman Urdu缺乏专门的预训练模型或大规模标注语料库，限制了现有方法的应用。

核心贡献

提出了端到端的Roman Urdu-English攻击性语言检测管道：构建了完整的从数据预处理到模型评估的处理流程。
将QLoRA应用于LLaMA和Mistral模型：首次将量化低秩适应技术应用于Roman Urdu攻击性语言检测任务。
进行了全面的对比评估：对比了QLoRA微调的大语言模型与传统微调的ModernBERT和RoBERTa模型的性能。
采用基于翻译的预处理策略：通过翻译方法利用英语大语言模型处理低资源代码混合文本。

数据收集与预处理
- 数据集包含46,026个样本（24,026个"攻击性"，22,000个"非攻击性"）
- 主要从Facebook公开评论和YouTube回复中抓取
- 由三名双语标注员手动标注，Cohen's Kappa一致性为0.86
翻译处理
- 使用deep_translator包中的GoogleTranslator库
- 将Roman Urdu文本翻译为英文以利用英语LLM
- 保持原始代码混合特性直到翻译阶段
数据集划分与标注
- 标签映射："攻击性"→1，"非攻击性"→0
- 使用分层采样进行80%训练、20%测试划分
- 对于解码器模型，输入格式化为提示风格

模型选择

选择了多样化的模型进行性能评估：

大语言模型：LLaMA 3 (8B)、LLaMA 2 (7B)、Mistral (7B)，使用QLoRA微调
传统Transformer：RoBERTa和ModernBERT，使用传统监督学习方法微调

QLoRA微调技术

核心参数设置：

rank (r=8)
alpha (32)
dropout (0.05)
适应层：q_proj和v_proj

技术优势：

通过低秩适配器和量化权重实现内存高效微调
保持性能的同时显著降低GPU内存使用

技术创新点

量化低秩适应的应用：首次将QLoRA技术应用于Roman Urdu攻击性语言检测，实现了大模型的高效微调。
翻译辅助的跨语言迁移：通过翻译策略弥合语言差距，提高模型对底层语义的理解。
多模型对比框架：建立了LLM与传统Transformer模型的系统性对比评估框架。

实验设置

数据集

规模：46,026个样本
来源：Facebook评论和YouTube回复
标注：三名双语标注员，Cohen's Kappa = 0.86
划分：80%训练，20%测试（分层采样）
预处理：最小化清理以保持上下文完整性

评价指标

准确率（Accuracy）
精确率（Precision）
召回率（Recall）
F1分数（F1 Score）

对比方法

LLaMA 3 (8B) + QLoRA
Mistral 7B + QLoRA
LLaMA 2 (7B) + QLoRA
RoBERTa (传统微调)
ModernBERT (传统微调)

实现细节

硬件：NVIDIA A100 (80GB VRAM)，128GB RAM，32核CPU
软件环境：Python 3.13.2，PyTorch，Transformers，PEFT等
超参数：学习率2e-5，批大小2，训练轮数10，权重衰减0.01
优化策略：梯度检查点，早停机制

实验结果

主要结果

模型	准确率	精确率	召回率	F1分数
LLaMA 3 (8B)	91.62	91.4	91.5	91.45
Mistral 7B	89.88	89.5	89.8	89.66
LLaMA 2 (7B)	88.74	88.2	88.6	88.4
RoBERTa	85.65	85.2	85.7	85.44
ModernBERT	83.92	83.1	84.0	83.55