Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.
Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning
- Paper ID: 2510.12807
- Title: Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning
- Authors: Mahdi Cherakhloo, Arash Abbasi, Mohammad Saeid Sarafraz, Bijan Vosoughi Vahdat
- Classification: cs.CL cs.AI
- Publication Date: October 16, 2025
- Paper Link: https://arxiv.org/abs/2510.12807
This study presents a comprehensive benchmark evaluation of multiple open-source large language models on Persian natural language processing tasks using zero-shot and few-shot learning paradigms. The research covers sentiment analysis, named entity recognition, reading comprehension, and question-answering tasks, utilizing established Persian datasets such as ParsiNLU and ArmanEmo. The experiments employ rigorous zero-shot and few-shot settings with performance evaluation using metrics including accuracy, F1 score, BLEU, and ROUGE. Results demonstrate that Gemma 2 achieves superior performance across nearly all tasks in both learning paradigms, particularly excelling in complex reasoning tasks. However, most models perform poorly on token-level understanding tasks such as named entity recognition, highlighting specific challenges in Persian language processing.
- Core Problem: The effectiveness of large language models on low-resource languages such as Persian requires in-depth investigation. While LLMs demonstrate excellent performance on high-resource languages like English, significant performance gaps remain for Persian and similar languages.
- Problem Significance:
- Persian possesses unique orthographic features, complex morphological structures, and grammatical patterns
- Compared to high-resource languages, Persian lacks comprehensive datasets, annotated corpora, and specialized NLP tools
- There is a need to provide equitable access to NLP technology for the Persian-speaking community
- Limitations of Existing Approaches:
- Lack of systematic LLM evaluation specifically for Persian
- Existing research primarily focuses on high-resource languages such as English
- Persian-specific linguistic phenomena remain insufficiently studied
- Research Motivation: To evaluate the capabilities of open-source LLMs on Persian tasks through zero-shot and few-shot learning paradigms, providing benchmarks for advancing NLP technology development in low-resource languages.
- Established the first comprehensive Persian LLM benchmark: Systematic evaluation of 11 open-source models across 50+ tasks
- Provided comparative analysis of zero-shot and few-shot learning paradigms: Revealed the impact of different learning paradigms on Persian tasks
- Identified specific challenges in Persian language processing: Particularly difficulties in token-level understanding tasks such as NER
- Provided baselines for future model development: Established important performance baselines and identified key areas requiring improvement
The research covers multiple core NLP tasks:
- Text Classification: Sentiment analysis, emotion detection
- Sequence Labeling: Named entity recognition
- Reading Comprehension: Context-based question answering
- Text Generation: Machine translation, text summarization
- Reasoning Tasks: Logical reasoning, commonsense reasoning, mathematical reasoning
Evaluated 11 representative open-source LLMs:
- Gemma2: Google's efficient transformer model with enhanced multilingual representation capabilities
- GLM4: Generative language model optimized for complex reasoning and understanding tasks
- LLaMA3.1/3.2: Meta AI's refined architecture with improved token representation for non-Latin scripts
- Qwen2/2.5: Alibaba's multilingual foundation models
- Mistral: Computationally efficient model employing grouped-query attention mechanisms
- Other Models: Marco-O1, Aya-Expanse, Falcon3, Tulu3
- Unified Evaluation Framework: Established standardized prompt templates and evaluation pipelines
- Multi-Paradigm Comparison: Systematically compared the effectiveness of zero-shot and few-shot learning
- Fine-Grained Analysis: Error analysis targeting Persian-specific linguistic phenomena
- Cross-Domain Evaluation: Covered multiple knowledge domains including humanities and STEM
- ParsiNLU:
- Reading Comprehension: 1,000 paragraph-question pairs
- Textual Entailment: 2,500 premise-hypothesis pairs
- Sentiment Classification: 12,000 sentences
- Machine Translation: 10,000 English-Persian parallel sentence pairs
- ArmanEmo: 7,500 Persian social media posts annotated with 8 emotion categories
- ArmanNER: 7,682 sentences containing Person, Location, and Organization entity types
- Persian MMLU: 1,200 multiple-choice questions covering logic, theology, sociology, mathematics, and natural sciences
- Persian News Summary: 95,000 article-summary pairs
- Classification Tasks: Accuracy and macro-averaged F1 score
- Named Entity Recognition: Token-level F1 score
- Reading Comprehension: Exact Match (EM) and token overlap F1 score
- Machine Translation: BLEU score
- Text Summarization: ROUGE-1, ROUGE-2, ROUGE-L scores
Employed unified experimental settings to compare 11 open-source LLMs, ensuring fair comparison.
- Hardware: NVIDIA A100 GPUs (40GB VRAM)
- Software: Hugging Face Transformers (v4.30.2), PyTorch (v2.0.1)
- Inference Parameters: Temperature set to 0.1 for generation tasks, greedy decoding for classification tasks
- Few-Shot Setting: Randomly selected 5 representative examples per task
Overall Performance Ranking:
- Gemma2: Few-shot 0.61, Zero-shot 0.42 (Best)
- GLM4: Few-shot 0.53, Zero-shot 0.35
- Qwen2.5: Few-shot 0.50, Zero-shot 0.35
- Other Models: Performance decreases sequentially
Key Findings:
- Gemma2 maintains leadership in both learning paradigms with average advantage exceeding 8%
- Few-shot learning universally outperforms zero-shot learning with average improvement of 13.8%
- Complex reasoning tasks benefit most significantly (17.3% improvement)
Advantageous Tasks:
- Logical Reasoning and Theology: Average scores of 0.412 and 0.395
- Reading Comprehension: 17.3% improvement in few-shot compared to zero-shot
- Textual Entailment: 15-20% improvement in few-shot setting
Challenging Tasks:
- Named Entity Recognition: Poor performance across all models with only 7.2% improvement in few-shot
- Mathematics and Computer Science: Average scores of 0.287 and 0.301
- Token-Level Prediction: Structural limitations restrict performance
Domain Knowledge Differences:
- Humanities average 0.395 vs. STEM fields 0.287
- Indicates uneven distribution of multilingual training data
Linguistic Phenomenon Analysis:
- Semantic disambiguation error rate 23.7% higher
- Complex sentiment expression misclassification rate 31.2% higher
- Multi-token entity error rate 27.8% higher
- Idiomatic expression error rate 34.5% higher
Success Cases: Gemma2 excels in logical reasoning tasks, capable of handling complex semantic relationships
Failure Cases: All models struggle with Persian-specific idioms and cultural context understanding
- Development of benchmarks such as GLUE and MMLU
- Cross-lingual transfer learning research
- Few-shot learning applications in multilingual environments
- Dataset construction including ParsiNLU, ArmanEmo, and ArmanNER
- FaMTEB large-scale text embedding benchmark
- Persian-specific models such as PersianMind and Maral
- Cross-lingual knowledge transfer methods
- Prompt engineering techniques
- Low-resource language adaptation strategies
- Model Performance Hierarchy: Gemma2 significantly outperforms other models, demonstrating architectural advantages
- Learning Paradigm Impact: Few-shot learning brings significant improvements, particularly on semantic reasoning tasks
- Task-Specific Challenges: Token-level tasks such as NER pose challenges for all models
- Cross-Lingual Performance Gap: Persian performance averages 18.7% lower compared to English benchmarks
- Model Selection: Does not cover all available models, particularly Persian-specific models
- Prompt Engineering: Limited extensive prompt optimization
- Dataset Representativeness: May not fully cover Persian dialectal variations
- Hyperparameter Optimization: Lacks task-specific hyperparameter tuning
- Example Quantity: Limited few-shot example numbers (3-5 examples)
- Model Diversification: Evaluate more Persian-specific LLMs
- Task Extension: Include complex tasks such as abstractive summarization and multi-turn dialogue
- Advanced Prompting Techniques: Explore dynamic prompt tuning and chain-of-thought reasoning
- Domain Adaptation: Develop benchmarks for specialized domains such as medicine and law
- Fine-Tuning Strategies: Investigate parameter-efficient fine-tuning methods
- Community Infrastructure: Establish community benchmark leaderboards
- Significant Research Value: Fills the gap in Persian LLM evaluation, providing important reference for low-resource language research
- Rigorous Experimental Design: Unified evaluation framework ensures fair comparison, covering multiple tasks and metrics
- Comprehensive Analysis: Provides not only performance data but also detailed error analysis and linguistic insights
- High Practical Value: Offers practical guidance for Persian NLP applications
- Limited Model Coverage: Lacks evaluation of some important Persian-specific models
- Insufficient Prompt Engineering: Standardized prompts may not fully leverage certain models' potential
- Limited Cultural Context Analysis: Analysis of Persian-specific cultural phenomena could be deeper
- Incomplete Computational Cost Description: Lacks detailed comparison of computational costs across models
- Academic Contribution: Provides important benchmark for multilingual LLM research, advancing low-resource language technology development
- Practical Value: Guides model selection and optimization for Persian NLP applications
- Reproducibility: Detailed experimental setup and open-source commitment support research reproduction
- Community Building: Promotes development of the Persian NLP research community
- Model Selection: Guides selection of appropriate foundation models for Persian NLP applications
- Benchmark Comparison: Serves as performance baseline for new model development
- Research Guidance: Provides direction for Persian-specific model improvements
- Educational Resource: Serves as teaching material for multilingual NLP courses
The paper cites 32 relevant references covering:
- LLM evaluation methodology research
- Multilingual capability evaluation frameworks
- Persian NLP resources and challenges
- Zero-shot and few-shot learning techniques
Key references include the ParsiNLU benchmark suite, ArmanEmo sentiment dataset, and important works on multilingual LLM capability assessment.
Summary: This is a high-quality empirical research paper that establishes an important benchmark for Persian LLM evaluation. The research methodology is rigorous, results are convincing, and it holds significant importance for advancing NLP technology development in low-resource languages. Despite some limitations, its contributions and impact are substantial.