2025-11-13T19:49:11.380535

Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning

Cherakhloo, Abbasi, Sarafraz et al.

Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.

academic

Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning

Basic Information

Paper ID: 2510.12807
Title: Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning
Authors: Mahdi Cherakhloo, Arash Abbasi, Mohammad Saeid Sarafraz, Bijan Vosoughi Vahdat
Classification: cs.CL cs.AI
Publication Date: October 16, 2025
Paper Link: https://arxiv.org/abs/2510.12807

Abstract

This study presents a comprehensive benchmark evaluation of multiple open-source large language models on Persian natural language processing tasks using zero-shot and few-shot learning paradigms. The research covers sentiment analysis, named entity recognition, reading comprehension, and question-answering tasks, utilizing established Persian datasets such as ParsiNLU and ArmanEmo. The experiments employ rigorous zero-shot and few-shot settings with performance evaluation using metrics including accuracy, F1 score, BLEU, and ROUGE. Results demonstrate that Gemma 2 achieves superior performance across nearly all tasks in both learning paradigms, particularly excelling in complex reasoning tasks. However, most models perform poorly on token-level understanding tasks such as named entity recognition, highlighting specific challenges in Persian language processing.

Research Background and Motivation

Core Problem: The effectiveness of large language models on low-resource languages such as Persian requires in-depth investigation. While LLMs demonstrate excellent performance on high-resource languages like English, significant performance gaps remain for Persian and similar languages.
Problem Significance:
- Persian possesses unique orthographic features, complex morphological structures, and grammatical patterns
- Compared to high-resource languages, Persian lacks comprehensive datasets, annotated corpora, and specialized NLP tools
- There is a need to provide equitable access to NLP technology for the Persian-speaking community
Limitations of Existing Approaches:
- Lack of systematic LLM evaluation specifically for Persian
- Existing research primarily focuses on high-resource languages such as English
- Persian-specific linguistic phenomena remain insufficiently studied
Research Motivation: To evaluate the capabilities of open-source LLMs on Persian tasks through zero-shot and few-shot learning paradigms, providing benchmarks for advancing NLP technology development in low-resource languages.

Core Contributions

Established the first comprehensive Persian LLM benchmark: Systematic evaluation of 11 open-source models across 50+ tasks
Provided comparative analysis of zero-shot and few-shot learning paradigms: Revealed the impact of different learning paradigms on Persian tasks
Identified specific challenges in Persian language processing: Particularly difficulties in token-level understanding tasks such as NER
Provided baselines for future model development: Established important performance baselines and identified key areas requiring improvement

Methodology Details

Task Definition

The research covers multiple core NLP tasks:

Text Classification: Sentiment analysis, emotion detection
Sequence Labeling: Named entity recognition
Reading Comprehension: Context-based question answering
Text Generation: Machine translation, text summarization
Reasoning Tasks: Logical reasoning, commonsense reasoning, mathematical reasoning

Model Architecture

Evaluated 11 representative open-source LLMs:

Gemma2: Google's efficient transformer model with enhanced multilingual representation capabilities
GLM4: Generative language model optimized for complex reasoning and understanding tasks
LLaMA3.1/3.2: Meta AI's refined architecture with improved token representation for non-Latin scripts
Qwen2/2.5: Alibaba's multilingual foundation models
Mistral: Computationally efficient model employing grouped-query attention mechanisms
Other Models: Marco-O1, Aya-Expanse, Falcon3, Tulu3

Technical Innovations

Unified Evaluation Framework: Established standardized prompt templates and evaluation pipelines
Multi-Paradigm Comparison: Systematically compared the effectiveness of zero-shot and few-shot learning
Fine-Grained Analysis: Error analysis targeting Persian-specific linguistic phenomena
Cross-Domain Evaluation: Covered multiple knowledge domains including humanities and STEM

Experimental Setup

Datasets

ParsiNLU:
- Reading Comprehension: 1,000 paragraph-question pairs
- Textual Entailment: 2,500 premise-hypothesis pairs
- Sentiment Classification: 12,000 sentences
- Machine Translation: 10,000 English-Persian parallel sentence pairs
ArmanEmo: 7,500 Persian social media posts annotated with 8 emotion categories
ArmanNER: 7,682 sentences containing Person, Location, and Organization entity types
Persian MMLU: 1,200 multiple-choice questions covering logic, theology, sociology, mathematics, and natural sciences
Persian News Summary: 95,000 article-summary pairs

Evaluation Metrics

Classification Tasks: Accuracy and macro-averaged F1 score
Named Entity Recognition: Token-level F1 score
Reading Comprehension: Exact Match (EM) and token overlap F1 score
Machine Translation: BLEU score
Text Summarization: ROUGE-1, ROUGE-2, ROUGE-L scores

Comparison Methods

Employed unified experimental settings to compare 11 open-source LLMs, ensuring fair comparison.

Implementation Details

Hardware: NVIDIA A100 GPUs (40GB VRAM)
Software: Hugging Face Transformers (v4.30.2), PyTorch (v2.0.1)
Inference Parameters: Temperature set to 0.1 for generation tasks, greedy decoding for classification tasks
Few-Shot Setting: Randomly selected 5 representative examples per task

Experimental Results

Main Results

Overall Performance Ranking:

Gemma2: Few-shot 0.61, Zero-shot 0.42 (Best)
GLM4: Few-shot 0.53, Zero-shot 0.35
Qwen2.5: Few-shot 0.50, Zero-shot 0.35
Other Models: Performance decreases sequentially

Key Findings:

Gemma2 maintains leadership in both learning paradigms with average advantage exceeding 8%
Few-shot learning universally outperforms zero-shot learning with average improvement of 13.8%
Complex reasoning tasks benefit most significantly (17.3% improvement)

Task-Specific Analysis

Advantageous Tasks:

Logical Reasoning and Theology: Average scores of 0.412 and 0.395
Reading Comprehension: 17.3% improvement in few-shot compared to zero-shot
Textual Entailment: 15-20% improvement in few-shot setting

Challenging Tasks:

Named Entity Recognition: Poor performance across all models with only 7.2% improvement in few-shot
Mathematics and Computer Science: Average scores of 0.287 and 0.301
Token-Level Prediction: Structural limitations restrict performance

Ablation Studies

Domain Knowledge Differences:

Humanities average 0.395 vs. STEM fields 0.287
Indicates uneven distribution of multilingual training data

Linguistic Phenomenon Analysis:

Semantic disambiguation error rate 23.7% higher
Complex sentiment expression misclassification rate 31.2% higher
Multi-token entity error rate 27.8% higher
Idiomatic expression error rate 34.5% higher

Case Studies

Success Cases: Gemma2 excels in logical reasoning tasks, capable of handling complex semantic relationships

Failure Cases: All models struggle with Persian-specific idioms and cultural context understanding

Multilingual LLM Evaluation

Development of benchmarks such as GLUE and MMLU
Cross-lingual transfer learning research
Few-shot learning applications in multilingual environments

Persian NLP Resources

Dataset construction including ParsiNLU, ArmanEmo, and ArmanNER
FaMTEB large-scale text embedding benchmark
Persian-specific models such as PersianMind and Maral

Zero-Shot and Few-Shot Learning

Cross-lingual knowledge transfer methods
Prompt engineering techniques
Low-resource language adaptation strategies

Conclusions and Discussion

Main Conclusions

Model Performance Hierarchy: Gemma2 significantly outperforms other models, demonstrating architectural advantages
Learning Paradigm Impact: Few-shot learning brings significant improvements, particularly on semantic reasoning tasks
Task-Specific Challenges: Token-level tasks such as NER pose challenges for all models
Cross-Lingual Performance Gap: Persian performance averages 18.7% lower compared to English benchmarks

Limitations

Model Selection: Does not cover all available models, particularly Persian-specific models
Prompt Engineering: Limited extensive prompt optimization
Dataset Representativeness: May not fully cover Persian dialectal variations
Hyperparameter Optimization: Lacks task-specific hyperparameter tuning
Example Quantity: Limited few-shot example numbers (3-5 examples)

Future Directions

Model Diversification: Evaluate more Persian-specific LLMs
Task Extension: Include complex tasks such as abstractive summarization and multi-turn dialogue
Advanced Prompting Techniques: Explore dynamic prompt tuning and chain-of-thought reasoning
Domain Adaptation: Develop benchmarks for specialized domains such as medicine and law
Fine-Tuning Strategies: Investigate parameter-efficient fine-tuning methods
Community Infrastructure: Establish community benchmark leaderboards

In-Depth Evaluation

Strengths

Significant Research Value: Fills the gap in Persian LLM evaluation, providing important reference for low-resource language research
Rigorous Experimental Design: Unified evaluation framework ensures fair comparison, covering multiple tasks and metrics
Comprehensive Analysis: Provides not only performance data but also detailed error analysis and linguistic insights
High Practical Value: Offers practical guidance for Persian NLP applications

Weaknesses

Limited Model Coverage: Lacks evaluation of some important Persian-specific models
Insufficient Prompt Engineering: Standardized prompts may not fully leverage certain models' potential
Limited Cultural Context Analysis: Analysis of Persian-specific cultural phenomena could be deeper
Incomplete Computational Cost Description: Lacks detailed comparison of computational costs across models

Impact

Academic Contribution: Provides important benchmark for multilingual LLM research, advancing low-resource language technology development
Practical Value: Guides model selection and optimization for Persian NLP applications
Reproducibility: Detailed experimental setup and open-source commitment support research reproduction
Community Building: Promotes development of the Persian NLP research community

Applicable Scenarios

Model Selection: Guides selection of appropriate foundation models for Persian NLP applications
Benchmark Comparison: Serves as performance baseline for new model development
Research Guidance: Provides direction for Persian-specific model improvements
Educational Resource: Serves as teaching material for multilingual NLP courses

References

The paper cites 32 relevant references covering:

LLM evaluation methodology research
Multilingual capability evaluation frameworks
Persian NLP resources and challenges
Zero-shot and few-shot learning techniques

Key references include the ParsiNLU benchmark suite, ArmanEmo sentiment dataset, and important works on multilingual LLM capability assessment.

Summary: This is a high-quality empirical research paper that establishes an important benchmark for Persian LLM evaluation. The research methodology is rigorous, results are convincing, and it holds significant importance for advancing NLP technology development in low-resource languages. Despite some limitations, its contributions and impact are substantial.