Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.
- Paper ID: 2510.10774
- Title: ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
- Authors: Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery (University of Tehran)
- Categories: cs.SD (Sound), cs.AI (Artificial Intelligence), cs.HC (Human-Computer Interaction), cs.LG (Machine Learning)
- Publication Date: October 14, 2025 (arXiv v2)
- Paper Link: https://arxiv.org/abs/2510.10774
Existing Persian language speech datasets are typically significantly smaller than their English counterparts, creating critical limitations for developing Persian speech technologies. This paper addresses this gap by introducing ParsVoice, the largest-scale Persian speech corpus specifically designed for text-to-speech (TTS) applications. The research team developed an automated pipeline to convert raw audiobook content into TTS-ready data, incorporating a BERT-based sentence completeness detector, a binary search boundary optimization method for precise audio-text alignment, and an audio-text quality assessment framework customized for Persian. The pipeline processed 2,000 audiobooks, yielding 3,526 hours of clean speech, further filtered to 1,804 hours of high-quality subset containing 470+ speakers. To validate the dataset, the team fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a speaker similarity MOS (SMOS) of 4.0/5.
- Data Scarcity Issue: Persian, spoken by over 100 million people globally, severely lacks representation in speech corpora compared to high-resource languages such as English.
- TTS-Specific Requirements: Text-to-speech systems have different data quality requirements than automatic speech recognition (ASR) systems. While ASR can benefit from noisy real-world data, TTS requires clean and precisely aligned audio-text pairs to generate natural speech.
- Limitations of Existing Datasets:
- DeepMine+: 480+ hours, 1850+ speakers, but with commercial restrictions
- DeepMine-Multi-TTS: 120 hours, 67 speakers
- ArmanTTS: 9 hours, single speaker
- ManaTTS: 86 hours, single speaker
Persian data scarcity extends beyond speech to text processing, creating cascading effects across multiple Persian language processing domains, including speech-to-text alignment systems, optical character recognition (OCR) models, and others, severely hindering the development of Persian language technologies.
- Construction of the largest publicly available Persian TTS corpus: Containing 1,804 hours of high-quality speech data with 470+ distinct speakers, representing a 10-fold increase compared to existing Persian resources
- Development of a scalable automated data construction pipeline:
- BERT-based sentence completeness detection
- Binary search boundary optimization algorithm
- Persian-specific quality assessment framework
- Implementation of phoneme-free Persian TTS: Achieving high-quality speech synthesis without explicit phoneme transcription through fine-tuning the XTTS model
- Provision of open-source dataset: Complete dataset publicly released to advance Persian speech technology development
Converting raw audiobook audio into high-quality TTS training data, including:
- Input: Raw audiobook audio files and corresponding text
- Output: Segmented audio-text pairs with accurate temporal alignment and quality scores
- Constraints: Maintaining sentence integrity, ensuring audio quality, enabling speaker identification
- Data Source: IranSeda platform (book.iranseda.ir)
- Scale: 3,800+ audiobooks with multi-category coverage
- Quality: Professional narrators, controlled recording environment, 44.1 kHz sampling rate
- Copyright: Publicly accessible, no copyright restrictions
Sentence Completeness Detection Model:
- Binary classifier fine-tuned on ParsBERT
- Training data: Complete Persian sentences and synthetically generated incomplete sentences
- Performance: F1 score of 97.4%
Three-Stage Segmentation Process:
- Acoustic Boundary Detection: Using WebRTC voice activity detection (VAD)
- Transcription and Alignment: Google Speech-to-Text API transcription
- Linguistic Validation: BERT classifier detects sentence completeness, with boundary expansion in 0.1-second increments when necessary
Two-Stage Search Strategy:
- Initial Adjustment: Removing 3 seconds from beginning and end
- Stability Verification: Checking transcription discrepancies
- Binary Search Optimization: Iteratively halving trimming intervals
- Fine-Grained Linear Search: 0.1-second increment precision alignment
Persian Text Quality Framework:
- Character Quality: Proportion of valid Persian characters and numerals
- Length Quality: Sentence length appropriateness assessment
- Repetition Score: Vocabulary diversity rewards
- Phoneme Coverage: Range of Persian characters and phonemes
Audio Quality Framework:
- Signal-to-noise ratio estimation
- Dynamic range analysis
- Spectral features and MFCC variance
- Clipping, silence, and background music detection
Two-Stage Identification Process:
- Local Speaker Separation: Clustering based on ECAPA-TDNN embeddings
- Global Speaker Identification: Cross-book speaker unification
- Sentence-Aware Segmentation: Combining acoustic boundary detection with linguistic completeness verification
- Adaptive Boundary Optimization: Efficient algorithm combining binary search with linear fine-tuning
- Persian-Specific Quality Assessment: Multi-dimensional quality assessment framework tailored to Persian characteristics
- Scalable Processing Pipeline: Automated pipeline capable of processing thousands of hours of audio content
- Raw Data: 3,807 books (9,538 hours), with 2,000 books actually processed
- Initial Segmentation: 5,158,344 audio segments
- After Filtering: 3,321,212 valid segments
- Final Dataset:
- Total: 3,526 hours, 470+ speakers
- TTS Subset: 1,804 hours of high-quality data
- Subjective Evaluation:
- Naturalness MOS (1-5 scale)
- Speaker similarity SMOS (1-5 scale)
- Text accuracy score
- Objective Evaluation:
- Word error rate (WER) and character error rate (CER)
- ECAPA-TDNN embedding cosine similarity
- FastSpeech2 End-to-End
- FastSpeech2 Cascaded
- Other Persian TTS systems (ManaTTS, DeepMine-Multi-TTS, etc.)
- Model: XTTS multilingual TTS model
- Training: BPE model training with 2,500 new Persian tokens
- Fine-tuning: Batch size 16, 170,000 steps
- Evaluation: 90 synthesized samples, 40 evaluators
| System | MOS | SMOS |
|---|
| XTTS + ParsVoice (This Work) | 3.60 | 4.00 |
| FastSpeech2 End-to-End | 3.72 | 4.02 |
| FastSpeech2 Cascaded | 3.34 | 3.81 |
- WER: 22.57%
- CER: 12.78%
- Speaker Similarity: 80% (based on ECAPA-TDNN embeddings)
- Text Accuracy: 4.0/5 (human evaluation)
- Boundary Optimization Effect: Removed 442.73 hours (11.2%) of unnecessary silence and noise
- Segmentation Statistics: 81.0% of segments required beginning trimming, 50.4% required ending trimming
- Average Segment Duration: 5.49 seconds (optimal for TTS training)
- Linguistic Diversity: 267,965 unique words, 25,499,474 tokens
- Detected Speakers: 1,815 unique speaker instances
- Gender Distribution: Approximately 33% female, 67% male
- Consistency: 97.0% consistency with known narrator labels
- LibriSpeech: Large-scale ASR corpus
- LJSpeech: Single-speaker TTS dataset
- VCTK: Multi-speaker English corpus
- Common Voice: 20+ languages, but insufficient Persian quality
- Multilingual LibriSpeech: Biased toward European languages
- VoxPopuli: Variable quality across language communities
- Traditional methods require explicit phoneme representation
- Existing datasets are small-scale and predominantly single-speaker
- Commercial restrictions hinder research development
- Successfully constructed the largest publicly available Persian TTS corpus containing 1,804 hours of high-quality speech data
- Developed a fully automated and scalable dataset construction pipeline applicable to other low-resource languages
- Validated dataset effectiveness achieving competitive performance on Persian TTS tasks
- Automatic evaluation metrics may underestimate quality: Due to limited commercial STT system support for Persian synthesized speech data
- Imbalanced speaker distribution: Higher proportion of male speakers (67% vs 33%)
- Audio quality dependent on source material: Limited by original audiobook recording quality
- Extension to other low-resource languages: Applying the pipeline to additional languages
- Improved quality assessment framework: Developing more accurate automatic evaluation metrics
- Enhanced speaker diversity: Balancing gender and age distribution
- Multimodal extension: Incorporating visual information in speech synthesis
- Significant Scale Improvement: Achieving 10-fold increase compared to existing Persian resources, filling an important gap
- Technical Innovation:
- Novel and effective BERT-based sentence completeness detection
- Efficient and practical binary search boundary optimization algorithm
- Strong targeted Persian-specific quality assessment framework
- Comprehensive Experimentation:
- Combined subjective and objective evaluation
- Comparison with multiple baseline methods
- Detailed dataset analysis and statistics
- Open-Source Contribution: Complete dataset publicly released to advance community development
- Method Reproducibility: Detailed description of each pipeline step
- Limited Evaluation Scope:
- Validation on only one TTS model (XTTS)
- Lack of direct comparison with other large-scale multilingual datasets
- Quality Assessment Subjectivity:
- Quality assessment framework weights based on empirical settings
- Lack of comparative validation with human annotation quality
- Insufficient Technical Details:
- Speaker identification threshold selection lacks detailed explanation
- Limited implementation details of quality assessment framework
- Academic Impact:
- Provides important resources for low-resource language TTS research
- Advances Persian speech technology development
- Offers reusable dataset construction methodology
- Practical Value:
- Directly supports Persian TTS application development
- Reduces digital divide between Persian and high-resource languages
- Provides foundational data for commercial speech applications
- Reproducibility: Open-source release and detailed methodology description ensure research reproducibility
- Direct Applications:
- Persian TTS system training
- Persian adaptation of multilingual TTS models
- Speech synthesis quality assessment research
- Extended Applications:
- Dataset construction for other low-resource languages
- Speech processing pipeline development
- Cross-lingual speech technology research
This paper cites 18 important references covering:
- Transformer architecture foundations (Vaswani et al., 2017)
- English speech datasets (LibriSpeech, LJSpeech, VCTK)
- Multilingual speech resources (Common Voice, VoxPopuli)
- Persian NLP tools (ParsBERT)
- Modern TTS technology (XTTS)
- Speaker identification technology (ECAPA-TDNN)
Overall Assessment: This is a high-quality resource paper that addresses the important resource scarcity problem by constructing a large-scale Persian TTS corpus. The methodology demonstrates moderate innovation but strong practicality, with sufficient experimental validation and significant implications for Persian speech technology development. The open-source release further enhances its academic and practical value.