2025-11-15T13:07:11.069047

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Kalahroodi, Faili, Shakery

Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

academic

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

基本信息

论文ID: 2510.10774
标题: ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
作者: Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery (University of Tehran)
分类: cs.SD (Sound), cs.AI (Artificial Intelligence), cs.HC (Human-Computer Interaction), cs.LG (Machine Learning)
发表时间: 2025年10月14日 (arXiv v2)
论文链接: https://arxiv.org/abs/2510.10774

摘要

现有的波斯语语音数据集通常比英语对应数据集小得多，这为开发波斯语语音技术创造了关键限制。本文通过引入ParsVoice来解决这一差距，这是专门为文本到语音(TTS)应用设计的最大规模波斯语语音语料库。研究团队创建了一个自动化管道，将原始有声读物内容转换为TTS就绪数据，包含基于BERT的句子完整性检测器、用于精确音频-文本对齐的二分搜索边界优化方法，以及针对波斯语定制的音频-文本质量评估框架。该管道处理了2,000本有声读物，产生了3,526小时的清洁语音，进一步过滤为1,804小时的高质量子集，包含470多名说话者。为验证数据集，研究团队对XTTS进行了波斯语微调，实现了3.6/5的自然度平均意见分数(MOS)和4.0/5的说话者相似度平均意见分数(SMOS)。

研究背景与动机

问题定义

数据稀缺性问题：波斯语作为全球超过1亿人使用的语言，在语音语料库方面严重缺乏代表性，与英语等高资源语言相比存在巨大差距。
TTS特殊需求：文本到语音系统对数据质量的要求与自动语音识别(ASR)系统不同。ASR可以从嘈杂的真实世界数据中受益，而TTS需要清洁且精确对齐的音频-文本对来生成自然的语音。
现有数据集局限性：
- DeepMine+：480+小时，1850+说话者，但商业限制
- DeepMine-Multi-TTS：120小时，67说话者
- ArmanTTS：9小时，单一说话者
- ManaTTS：86小时，单一说话者