This paper details our submission to the AraGenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, secured 5th place. We investigated the effectiveness of three pre-trained transformer models: AraELECTRA, CAMeLBERT, and XLM-RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a surprising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the specialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capabilities of multilingual models.
- Paper ID: 2510.20610
- Title: BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
- Authors: Ali Zain, Sareem Farooqui, Muhammad Rafi (National University of Computer and Emerging Sciences, FAST, Karachi, Pakistan)
- Classification: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
- Publication Date: October 25, 2025 (arXiv version)
- Paper Link: https://arxiv.org/abs/2510.20610v2
This paper presents a detailed account of the BUSTED team's submission to the AraGenEval shared task on Arabic AI-generated text detection, achieving 5th place. The researchers compared the effectiveness of three pre-trained Transformer models: AraELECTRA, CAMeLBERT, and XLM-RoBERTa. The methodology involved fine-tuning each model on the provided dataset for binary classification. The study revealed a surprising finding: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, surpassing specialized Arabic-language models. This work emphasizes the complexity of AI-generated text detection and highlights the strong generalization capabilities of multilingual models.
With the increasing sophistication of Large Language Models (LLMs), the boundary between human-authored and machine-generated text has become blurred. This reality presents significant societal risks, ranging from accelerating misinformation dissemination to undermining academic integrity. Consequently, developing reliable AI-generated text detectors has become an urgent research priority.
- Social Impact: Misuse of AI-generated text may lead to misinformation propagation and academic misconduct
- Technical Challenges: Modern LLMs generate highly fluent text, limiting the effectiveness of traditional detection methods
- Language Specificity: Arabic, as a relatively low-resource language, still lacks mature tools in the AI text detection domain
- Inadequacy of Traditional Methods: Early statistical stylometry-based approaches (such as n-gram frequency, readability scores, syntactic structure) perform poorly on detecting fluent text from modern LLMs
- Scarcity of Language Resources: Arabic AI text detection tools lag behind those for other languages
- Unclear Model Selection: Lack of systematic comparison of different Transformer architectures on Arabic AI text detection tasks
- Comparative Model Study: Provides direct comparison between monolingual and multilingual models on Arabic text detection tasks
- Counterintuitive Findings: Demonstrates that multilingual models can achieve superior performance compared to specialized language models
- Preprocessing Impact Analysis: Analyzes how preprocessing choices such as text normalization can unexpectedly harm model performance
- Practical Validation: Achieves 5th place in the AraGenEval shared task, validating the effectiveness of the approach
- Input: An Arabic text string
- Output: Binary label ('human' or 'machine')
- Task Type: Binary text classification problem
The researchers implemented systems based on three different pre-trained models:
- Model: aubmindlab/araelectra-base-discriminator
- Characteristics: Specialized Arabic ELECTRA model
- Preprocessing: Applies aggressive Arabic text normalization
- Normalizes various Arabic characters (e.g., alef variants to standard alef)
- Converts ta marbuta to ha
- Removes all Arabic diacritics and non-alphanumeric characters
- Model: CAMeL-Lab/bert-base-arabic-camelbert-mix
- Characteristics: Widely-used Arabic BERT model
- Preprocessing: No specific text normalization applied; relies entirely on the model's pre-trained tokenizer
- Model: xlm-roberta-base
- Characteristics: Large-scale multilingual model
- Preprocessing: Similar to CAMeLBERT setup; no language-specific normalization performed
- Systematic Comparison: First systematic comparison of monolingual vs. multilingual models on Arabic AI text detection tasks
- Differentiated Preprocessing Strategies: Explores the impact of different preprocessing strategies on model performance
- Data-Driven Analysis: Guides model selection and optimization based on dataset characteristics
- Dataset: AraGenEval dataset
- Scale: 4,734 training samples after cleaning
- Class Distribution: Nearly balanced
- Machine-generated: 2,399 samples (50.68%)
- Human-authored: 2,335 samples (49.32%)
- Significant Text Length Differences:
- Average length of human-authored text: 4,059.13 characters
- Average length of machine-generated text: 1,934.53 characters
- Vocabulary and N-gram Differences:
- Human text: Frequently contains current-event-related vocabulary such as "Gaza," "the war," "Israel"
- Machine text: Uses more generic formal vocabulary, such as "can be," "in a way"
- AraELECTRA & CAMeLBERT: Use all 4,734 training samples for training and development phase evaluation
- XLM-RoBERTa: Split training data in 80/20 ratio
- Training set: 3,787 samples
- Validation set: 947 samples
- Employs stratified sampling to maintain label distribution
- Primary Metric: Macro-averaged F1 score
- Auxiliary Metrics: Accuracy, Precision, Recall, Specificity, Balanced Accuracy
| Hyperparameter | Value |
|---|
| Learning Rate | 2e-5 |
| Batch Size | 4 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Maximum Sequence Length | 512 |
| Training Epochs (AraELECTRA) | 4 |
| Training Epochs (CAMeLBERT) | 4 |
| Training Epochs (XLM-RoBERTa) | 5 |
| Model | F1-Score | Accuracy | Precision | Recall | Specificity | Balanced Accuracy |
|---|
| XLM-RoBERTa | 0.7701 | 0.760 | 0.7390 | 0.804 | 0.716 | 0.760 |
| CAMeLBERT | 0.7290 | 0.710 | 0.6842 | 0.780 | 0.640 | 0.710 |
| AraELECTRA | 0.6180 | 0.550 | 0.5369 | 0.728 | 0.372 | 0.550 |
- Multilingual Model Advantage: XLM-RoBERTa achieves the best performance across all metrics, significantly outperforming specialized Arabic models
- Preprocessing Strategy Impact: AraELECTRA's aggressive text normalization strategy may be counterproductive
- Performance Ranking: XLM-RoBERTa > CAMeLBERT > AraELECTRA
- Diverse Pre-training Corpus: Extensive pre-training on 100 languages may provide stronger feature extraction capabilities for generalization
- Style Sensitivity: Better captures stylistic differences between human text (news-focused) and machine text (formal and analytical)
- Over-normalization: Aggressive text normalization and diacritic removal may eliminate critical fine-grained signals
- Information Loss: Removes important distinguishing features such as lexical style choices and specific named entities
- Precision vs. Recall: All models show lower precision than recall, indicating a tendency to misclassify human text as machine-generated
- Possible Causes: Domain mismatch or formulaic human-authored text may resemble AI generation patterns
- Early Methods: Statistical stylometry-based authorship attribution and machine text detection
- Features: n-gram frequency, readability scores, syntactic structure
- Limitations: Limited effectiveness on modern LLMs
- Neural Network Methods: Current mainstream research
- Fine-tuning pre-trained Transformers (such as BERT)
- Detecting statistical artifacts in LLM generation processes
- Embedding "watermarks" in text generation processes
- Follows the fine-tuning paradigm
- Inspired by comprehensive comparative studies (such as Al-Shboul et al., 2024)
- Focuses on AI text detection in resource-scarce Arabic language domain
- Unexpected Advantage of Multilingual Models: XLM-RoBERTa surpasses specialized Arabic models on Arabic AI text detection tasks
- Double-Edged Effect of Preprocessing: Excessive text normalization may harm model performance
- Importance of Data Characteristics: Text length and vocabulary choice are key features for distinguishing human from machine text
- Poor AraELECTRA Performance: Primarily due to inappropriate preprocessing strategy choices
- Insufficient Error Analysis: Lacks detailed qualitative error analysis
- Limited Single-Dataset Validation: Validation only on the AraGenEval dataset
- Preprocessing Optimization: Explore less aggressive text normalization methods
- Model Ensemble: Experiment with model ensemble techniques
- In-Depth Error Analysis: Better understand failure patterns in the task
- Cross-Domain Generalization: Validate the approach on multiple Arabic datasets
- Systematic Comparison: Provides comprehensive comparison of different types of Transformer models
- Counterintuitive Findings: The finding that multilingual models outperform specialized language models is significant
- Practical Value: Achieves good performance in actual competition, validating method effectiveness
- Sufficient Data Analysis: Provides in-depth analysis of dataset characteristics, informing model selection
- Reasonable Experimental Design: Appropriate hyperparameter settings and evaluation metric selection
- Inconsistent Preprocessing Strategies: Three models use different preprocessing strategies, affecting comparison fairness
- Inconsistent Data Splitting: Different models use different data splitting strategies
- Missing Error Analysis: Lacks in-depth analysis of model failure cases
- Insufficient Ablation Studies: Does not fully verify component contributions
- Limited Generalization Validation: Validation only on a single dataset
- Academic Contribution: Provides important benchmarks for Arabic AI text detection
- Practical Guidance: Offers reference for model selection in similar tasks
- Methodological Value: Systematic comparison methodology applicable to other languages and tasks
- Reproducibility: Provides detailed experimental settings for easy reproduction
- Arabic Content Moderation: AI text detection for social media and news platforms
- Academic Integrity Checking: Assignment and paper originality verification in educational institutions
- Multilingual Environments: Scenarios requiring AI text detection across multiple languages
- Resource-Constrained Environments: Provides methodological reference for AI text detection in other low-resource languages
This paper cites multiple important related works, including:
- Transformer architecture foundational paper (Vaswani et al., 2017)
- BERT model (Devlin et al., 2019)
- ELECTRA model (Clark et al., 2020)
- XLM-RoBERTa model (Conneau et al., 2020)
- Specialized Arabic models: AraELECTRA (Antoun et al., 2021) and CAMeLBERT (Inoue et al., 2021)
- Arabic text classification survey (Al-Shboul et al., 2024)
Overall Assessment: This is a solid empirical research paper that reveals through systematic comparison the unexpected advantage of multilingual models on Arabic AI text detection tasks. Despite some methodological limitations, its findings hold important value for the field and provide beneficial direction for future research.