We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
academicHPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
- Paper ID: 2511.01066
- Title: HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
- Authors: Stephan Oepen and researchers from multiple European academic institutions
- Classification: cs.CL (Computational Linguistics)
- Publication Date: November 2025
- Paper Link: https://arxiv.org/abs/2511.01066
This paper introduces the HPLT 3.0 project, an initiative aimed at providing open, ultra-large-scale, high-quality, and richly annotated text datasets for nearly 200 languages. The dataset contains 300 trillion tokens, potentially representing the largest publicly available multilingual LLM pretraining dataset collection to date. The data originates from various web crawlers and is equipped with a complete open-source processing pipeline, including document selection, text extraction, language identification, deduplication, and quality assessment functionalities.
- Data Scarcity: Large-scale, high-quality multilingual pretraining data is typically controlled by large enterprises, with limited resources available to the academic community
- Language Inequality: Existing datasets predominantly favor English, with severe data insufficiency for other languages, particularly low-resource languages
- Quality Control: Web-crawled data exhibits highly variable quality, necessitating systematic cleaning and filtering mechanisms
- Evaluation Standards: Lack of unified evaluation frameworks for multilingual models
- Democratizing AI: Lowering the barriers to LLM development through open large-scale datasets
- Multilingual Fairness: Providing increased training data for low-resource languages, promoting linguistic diversity
- Academic Research: Furnishing the research community with reproducible experimental foundations
- Datasets such as C4 and FineWeb primarily focus on English
- Multilingual datasets like MADLAD-400 have relatively limited scale
- Lack of unified data processing and evaluation standards
- Constructed a 300 trillion token ultra-large-scale multilingual dataset covering nearly 200 languages
- Developed a complete open-source data processing pipeline including text extraction, language identification, deduplication, and quality assessment
- Proposed the HPLT-E multilingual evaluation framework encompassing 127 tasks across 9 European languages
- Trained 57 monolingual encoder-decoder models and multiple GPT-style reference models
- Constructed large-scale parallel text datasets including automatically mined and machine translation-synthesized data
- Provided comprehensive data quality analysis including statistical analysis and manual inspection
- Internet Archive (IA): 3.3 PB of crawler data from 2012-2020
- Common Crawl (CC): 57 complete snapshots (2014-2025), approximately 7.2 PB total
- Text Extraction
- Employing the Trafilatura framework for HTML text extraction
- Optimizing hyperparameter settings, prioritizing extraction quality over speed
- Language Identification
- Adopting the OpenLID-v2 model for language prediction
- Supporting language labels in the Flores+ evaluation set
- Improving preprocessing pipeline: space normalization, lowercasing, removal of non-word characters
- Deduplication Processing
- Implementing global approximate deduplication based on MinHash for all languages except English, Russian, and Chinese
- Employing per-crawler deduplication for large languages to enhance computational efficiency
- Quality Assessment and Annotation
- Web Docs Scorer (WDS): Integrating heuristic document filtering methods
- Register Labels: Using the Turku web register classifier to add stylistic tags for 104 languages
- WDS Levels: Categorizing documents into six quality levels {5, 6, 7, 8, 9, 10}
- Binning and globally sorting documents for each language by WDS level
- Using Zstandard-compressed JSONlines format
- Totaling approximately 50TB of data distributed across 3000 files
Nine European languages: English, Spanish, French, German, Italian, Czech, Finnish, Norwegian, Ukrainian, etc.
- Architecture: Decoder model based on Llama architecture
- Scale: 2.15B parameters, 24 layers, 32 attention heads
- Training Data: 100B tokens per language
- Sequence Length: 2048
- Training Platform: LUMI supercomputer, 16 nodes with AMD MI250x GPUs
Comprising 127 language understanding and generation tasks, covering:
- Textual entailment
- Commonsense reasoning
- Language-specific and world knowledge
- Paraphrasing
- Reading comprehension
- Sentiment analysis
- Toxicity detection
- Factuality assessment
- Architecture: T5-base (approximately 275M parameters)
- Language Coverage: 57 languages
- Language Families: Spanning 14 language families
- Named Entity Recognition: WikiAnn benchmark
- Language Proficiency: MultiBLiMP benchmark
| Dataset | English Docs | English Tokens | Multilingual Docs | Multilingual Tokens | Total Tokens |
|---|
| HPLT 3.0 | 18B | 16T | 11B | 13T | 29T |
| FineWeb | 24B | 17T | 5.0B | 4.9T | 22T |
| HPLT 2.0 | 4.4B | 3.9T | 6.1B | 7.2T | 11T |
| MADLAD-400 | 1.5B | 1.7T | 2.1B | 2.7T | 4.4T |
According to the HPLT-E framework evaluation, model performance ranking:
- MADLAD-400: Highest multilingual score
- HPLT 3.0: Second place, significantly outperforming previous versions
- HPLT 2.0 and FineWeb: Comparable performance
- Low-quality data (bottom WDS levels): Noticeably reduces model performance
- High-quality data (top WDS levels): Comparable to random sampling performance, possibly due to insufficient diversity
- Random sampling: Best performance on Spanish and French
| Language | HPLT T5 | mT5-base | BERT HPLT |
|---|
| Catalan | 92.7 | 87.4 | 94.5 |
| Czech | 91.6 | 85.2 | 91.8 |
| English | 82.1 | 77.6 | 82.7 |
| Basque | 92.0 | 82.8 | 92.9 |
| Finnish | 90.3 | 1.8 | 91.6 |
| Language | HPLT T5 | mT5-base | mT5-xxl |
|---|
| Catalan | 95.6 | 91.6 | 93.0 |
| Czech | 95.9 | 88.8 | 93.4 |
| English | 94.2 | 90.6 | 95.3 |
| Basque | 97.4 | 94.9 | 96.0 |
Average Performance: HPLT T5 model achieves 93.5% on MultiBLIMP, significantly outperforming mT5-base's 86.8%
- Pornographic Content: Below 2% for most languages
- Language Identification Errors: Generally low overall, though Bosnian dataset primarily contains Serbian, and Asturian frequently contains Spanish
- Non-Natural Text: Varies considerably across languages, partially reflecting subjective annotation standards
- Text Defects: Including navigation elements, truncated text, etc., proportions varying by language
- Unique Paragraph Ratio: HPLT 3.0 at 73% vs HPLT 2.0 at 52%, reflecting global deduplication effectiveness
- Domain Diversity: Reduced over-representation of Wikipedia pages compared to HPLT 2.0
- Geographic TLD Distribution: Highly correlated with regional language usage
- C4: Google and Allen AI's primarily English-focused dataset
- FineWeb: Hugging Face's high-quality web data
- MADLAD-400: Google's 400-language dataset
- Nemotron-CC: Nvidia's refined Common Crawl data
- Existing Benchmarks: Predominantly biased toward English or limited high-resource languages
- Evaluation Challenges: Prompt sensitivity, cross-lingual consistency, cultural bias, etc.
- Text Extraction: Development of tools such as Trafilatura
- Language Identification: Evolution from traditional methods to deep learning models
- Deduplication Techniques: From exact matching to approximate matching methods
- Scale Breakthrough: HPLT 3.0 with 300 trillion tokens represents the largest public multilingual pretraining dataset
- Quality Enhancement: Improved processing pipeline significantly enhances data quality, reflected in model performance
- Evaluation Innovation: HPLT-E framework establishes new standards for multilingual model evaluation
- Model Contribution: 57 monolingual encoder-decoder models provide practical tools for the community
- Quality Assessment: Despite manual inspection, quality evaluation of large-scale data remains challenging
- Language Coverage: While supporting nearly 200 languages, resource distribution remains imbalanced
- Evaluation Scope: HPLT-E framework currently covers only 9 European languages
- Computational Resources: Large-scale training requires substantial computational resources, limiting reproducibility
- Data Expansion: Planning extended release including ArchiveBot data in early 2026
- Evaluation Extension: Expanding HPLT-E framework to more languages and tasks
- Quality Improvement: Continuing optimization of data processing pipeline and quality control mechanisms
- Application Research: Exploring synthetic data applications in low-resource languages
- Unprecedented Scale: 300 trillion tokens represents the largest scale among public datasets
- Open Transparency: Complete open-source pipeline and detailed technical documentation
- Systematic Approach: Complete ecosystem from data collection to model training
- Quality Control: Multi-layered quality assessment and manual verification mechanisms
- Practical Value: Provides directly usable pre-trained models
- Computational Threshold: While data is open, training large models still requires substantial computational resources
- Quality Variance: Significant differences in data quality and quantity across languages
- Evaluation Limitations: Relatively small manual evaluation samples, potentially introducing bias
- Cultural Bias: Inherent geographic and cultural biases in web data difficult to completely eliminate
- Academic Contribution: Provides important infrastructure for multilingual NLP research
- Industry Impact: Lowers development barriers for multilingual AI applications
- Social Value: Promotes linguistic diversity and democratization of AI technology
- Standard Setting: HPLT-E evaluation framework may become industry standard
- Multilingual LLM Pretraining: Direct application to large language model pretraining
- Language-Specific Models: Developing specialized models for low-resource languages
- Cross-Lingual Research: Supporting linguistics and computational linguistics research
- Machine Translation: Providing parallel corpora and monolingual data
- Educational Applications: Furnishing resources for language learning and instruction
- Global Deduplication: Cross-crawler global approximate deduplication enhancing data diversity
- Quality Grading: WDS scoring system providing fine-grained quality control
- Multi-Dimensional Annotation: Combining register labels, quality assessment, PII detection, and other annotations
- Multi-Prompt Design: Each task supporting 3-7 human-written prompts, reducing prompt sensitivity
- Task Selection Criteria: Selecting evaluation tasks based on seven standards including monotonicity and stability
- Aggregation Methods: Combining multiple aggregation approaches including average scores, rankings, and Borda count
- Language-Specific Models: Training specialized encoder-decoder models for 57 languages
- Intermediate Checkpoints: Providing intermediate checkpoints during training, supporting learning process research
- Synthetic Data: Generating additional pretraining data through machine translation
This paper cites extensive related work, primarily including:
- Raffel et al. (2020): T5 model and C4 dataset
- Penedo et al. (2024, 2025): FineWeb dataset series
- Kudugunta et al. (2023): MADLAD-400 dataset
- Burchell et al. (2025): HPLT 2.0 dataset
- Multiple papers on multilingual evaluation benchmarks
Summary: The HPLT 3.0 project represents an important milestone in multilingual NLP, achieving breakthroughs not only in data scale but also establishing new benchmarks in openness, quality control, and evaluation standards. While certain limitations remain, it holds significant importance for promoting democratization and development of multilingual AI technology.