2025-11-11T13:22:08.595769

BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection

Zain, Farooqui, Rafi

This paper details our submission to the AraGenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, secured 5th place. We investigated the effectiveness of three pre-trained transformer models: AraELECTRA, CAMeLBERT, and XLM-RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a surprising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the specialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capabilities of multilingual models.

academic

BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection

Basic Information

Paper ID: 2510.20610
Title: BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
Authors: Ali Zain, Sareem Farooqui, Muhammad Rafi (National University of Computer and Emerging Sciences, FAST, Karachi, Pakistan)
Classification: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
Publication Date: October 25, 2025 (arXiv version)
Paper Link: https://arxiv.org/abs/2510.20610v2

Abstract

This paper presents a detailed account of the BUSTED team's submission to the AraGenEval shared task on Arabic AI-generated text detection, achieving 5th place. The researchers compared the effectiveness of three pre-trained Transformer models: AraELECTRA, CAMeLBERT, and XLM-RoBERTa. The methodology involved fine-tuning each model on the provided dataset for binary classification. The study revealed a surprising finding: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, surpassing specialized Arabic-language models. This work emphasizes the complexity of AI-generated text detection and highlights the strong generalization capabilities of multilingual models.

Research Background and Motivation

Problem Definition

With the increasing sophistication of Large Language Models (LLMs), the boundary between human-authored and machine-generated text has become blurred. This reality presents significant societal risks, ranging from accelerating misinformation dissemination to undermining academic integrity. Consequently, developing reliable AI-generated text detectors has become an urgent research priority.

Research Significance

Social Impact: Misuse of AI-generated text may lead to misinformation propagation and academic misconduct
Technical Challenges: Modern LLMs generate highly fluent text, limiting the effectiveness of traditional detection methods
Language Specificity: Arabic, as a relatively low-resource language, still lacks mature tools in the AI text detection domain

Limitations of Existing Approaches

Inadequacy of Traditional Methods: Early statistical stylometry-based approaches (such as n-gram frequency, readability scores, syntactic structure) perform poorly on detecting fluent text from modern LLMs
Scarcity of Language Resources: Arabic AI text detection tools lag behind those for other languages
Unclear Model Selection: Lack of systematic comparison of different Transformer architectures on Arabic AI text detection tasks

Core Contributions

Comparative Model Study: Provides direct comparison between monolingual and multilingual models on Arabic text detection tasks
Counterintuitive Findings: Demonstrates that multilingual models can achieve superior performance compared to specialized language models
Preprocessing Impact Analysis: Analyzes how preprocessing choices such as text normalization can unexpectedly harm model performance
Practical Validation: Achieves 5th place in the AraGenEval shared task, validating the effectiveness of the approach

Methodology Details

Task Definition

Input: An Arabic text string
Output: Binary label ('human' or 'machine')
Task Type: Binary text classification problem

Model Architecture

The researchers implemented systems based on three different pre-trained models:

System 1: AraELECTRA

Model: aubmindlab/araelectra-base-discriminator
Characteristics: Specialized Arabic ELECTRA model
Preprocessing: Applies aggressive Arabic text normalization
- Normalizes various Arabic characters (e.g., alef variants to standard alef)
- Converts ta marbuta to ha
- Removes all Arabic diacritics and non-alphanumeric characters

System 2: CAMeLBERT

Model: CAMeL-Lab/bert-base-arabic-camelbert-mix
Characteristics: Widely-used Arabic BERT model
Preprocessing: No specific text normalization applied; relies entirely on the model's pre-trained tokenizer

System 3: XLM-RoBERTa

Model: xlm-roberta-base
Characteristics: Large-scale multilingual model
Preprocessing: Similar to CAMeLBERT setup; no language-specific normalization performed

Technical Innovations

Systematic Comparison: First systematic comparison of monolingual vs. multilingual models on Arabic AI text detection tasks
Differentiated Preprocessing Strategies: Explores the impact of different preprocessing strategies on model performance
Data-Driven Analysis: Guides model selection and optimization based on dataset characteristics

Experimental Setup

Dataset

Dataset: AraGenEval dataset
Scale: 4,734 training samples after cleaning
Class Distribution: Nearly balanced
- Machine-generated: 2,399 samples (50.68%)
- Human-authored: 2,335 samples (49.32%)

Data Characteristics Analysis

Significant Text Length Differences:
- Average length of human-authored text: 4,059.13 characters
- Average length of machine-generated text: 1,934.53 characters
Vocabulary and N-gram Differences:
- Human text: Frequently contains current-event-related vocabulary such as "Gaza," "the war," "Israel"
- Machine text: Uses more generic formal vocabulary, such as "can be," "in a way"

Data Splitting Strategy

AraELECTRA & CAMeLBERT: Use all 4,734 training samples for training and development phase evaluation
XLM-RoBERTa: Split training data in 80/20 ratio
- Training set: 3,787 samples
- Validation set: 947 samples
- Employs stratified sampling to maintain label distribution

Evaluation Metrics

Primary Metric: Macro-averaged F1 score
Auxiliary Metrics: Accuracy, Precision, Recall, Specificity, Balanced Accuracy

Implementation Details

Hyperparameter	Value
Learning Rate	2e-5
Batch Size	4
Optimizer	AdamW
Weight Decay	0.01
Maximum Sequence Length	512
Training Epochs (AraELECTRA)	4
Training Epochs (CAMeLBERT)	4
Training Epochs (XLM-RoBERTa)	5

Experimental Results

Main Results

Model	F1-Score	Accuracy	Precision	Recall	Specificity	Balanced Accuracy
XLM-RoBERTa	0.7701	0.760	0.7390	0.804	0.716	0.760
CAMeLBERT	0.7290	0.710	0.6842	0.780	0.640	0.710
AraELECTRA	0.6180	0.550	0.5369	0.728	0.372	0.550

Key Findings

Multilingual Model Advantage: XLM-RoBERTa achieves the best performance across all metrics, significantly outperforming specialized Arabic models
Preprocessing Strategy Impact: AraELECTRA's aggressive text normalization strategy may be counterproductive
Performance Ranking: XLM-RoBERTa > CAMeLBERT > AraELECTRA

Results Analysis

Reasons for XLM-RoBERTa's Success

Diverse Pre-training Corpus: Extensive pre-training on 100 languages may provide stronger feature extraction capabilities for generalization
Style Sensitivity: Better captures stylistic differences between human text (news-focused) and machine text (formal and analytical)

Reasons for AraELECTRA's Poor Performance

Over-normalization: Aggressive text normalization and diacritic removal may eliminate critical fine-grained signals
Information Loss: Removes important distinguishing features such as lexical style choices and specific named entities

Error Pattern Analysis

Precision vs. Recall: All models show lower precision than recall, indicating a tendency to misclassify human text as machine-generated
Possible Causes: Domain mismatch or formulaic human-authored text may resemble AI generation patterns

Historical Development

Early Methods: Statistical stylometry-based authorship attribution and machine text detection
- Features: n-gram frequency, readability scores, syntactic structure
- Limitations: Limited effectiveness on modern LLMs
Neural Network Methods: Current mainstream research
- Fine-tuning pre-trained Transformers (such as BERT)
- Detecting statistical artifacts in LLM generation processes
- Embedding "watermarks" in text generation processes

Paper Positioning

Follows the fine-tuning paradigm
Inspired by comprehensive comparative studies (such as Al-Shboul et al., 2024)
Focuses on AI text detection in resource-scarce Arabic language domain

Conclusions and Discussion

Main Conclusions

Unexpected Advantage of Multilingual Models: XLM-RoBERTa surpasses specialized Arabic models on Arabic AI text detection tasks
Double-Edged Effect of Preprocessing: Excessive text normalization may harm model performance
Importance of Data Characteristics: Text length and vocabulary choice are key features for distinguishing human from machine text

Limitations

Poor AraELECTRA Performance: Primarily due to inappropriate preprocessing strategy choices
Insufficient Error Analysis: Lacks detailed qualitative error analysis
Limited Single-Dataset Validation: Validation only on the AraGenEval dataset

Future Directions

Preprocessing Optimization: Explore less aggressive text normalization methods
Model Ensemble: Experiment with model ensemble techniques
In-Depth Error Analysis: Better understand failure patterns in the task
Cross-Domain Generalization: Validate the approach on multiple Arabic datasets

In-Depth Evaluation

Strengths

Systematic Comparison: Provides comprehensive comparison of different types of Transformer models
Counterintuitive Findings: The finding that multilingual models outperform specialized language models is significant
Practical Value: Achieves good performance in actual competition, validating method effectiveness
Sufficient Data Analysis: Provides in-depth analysis of dataset characteristics, informing model selection
Reasonable Experimental Design: Appropriate hyperparameter settings and evaluation metric selection

Weaknesses

Inconsistent Preprocessing Strategies: Three models use different preprocessing strategies, affecting comparison fairness
Inconsistent Data Splitting: Different models use different data splitting strategies
Missing Error Analysis: Lacks in-depth analysis of model failure cases
Insufficient Ablation Studies: Does not fully verify component contributions
Limited Generalization Validation: Validation only on a single dataset

Impact

Academic Contribution: Provides important benchmarks for Arabic AI text detection
Practical Guidance: Offers reference for model selection in similar tasks
Methodological Value: Systematic comparison methodology applicable to other languages and tasks
Reproducibility: Provides detailed experimental settings for easy reproduction

Applicable Scenarios

Arabic Content Moderation: AI text detection for social media and news platforms
Academic Integrity Checking: Assignment and paper originality verification in educational institutions
Multilingual Environments: Scenarios requiring AI text detection across multiple languages
Resource-Constrained Environments: Provides methodological reference for AI text detection in other low-resource languages

References

This paper cites multiple important related works, including:

Transformer architecture foundational paper (Vaswani et al., 2017)
BERT model (Devlin et al., 2019)
ELECTRA model (Clark et al., 2020)
XLM-RoBERTa model (Conneau et al., 2020)
Specialized Arabic models: AraELECTRA (Antoun et al., 2021) and CAMeLBERT (Inoue et al., 2021)
Arabic text classification survey (Al-Shboul et al., 2024)

Overall Assessment: This is a solid empirical research paper that reveals through systematic comparison the unexpected advantage of multilingual models on Arabic AI text detection tasks. Despite some methodological limitations, its findings hold important value for the field and provide beneficial direction for future research.