2025-11-11T15:01:09.602202

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Oepen, Arefev, Aulamo et al.
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
academic

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Basic Information

  • Paper ID: 2511.01066
  • Title: HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
  • Authors: Stephan Oepen and researchers from multiple European academic institutions
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: November 2025
  • Paper Link: https://arxiv.org/abs/2511.01066

Abstract

This paper introduces the HPLT 3.0 project, an initiative aimed at providing open, ultra-large-scale, high-quality, and richly annotated text datasets for nearly 200 languages. The dataset contains 300 trillion tokens, potentially representing the largest publicly available multilingual LLM pretraining dataset collection to date. The data originates from various web crawlers and is equipped with a complete open-source processing pipeline, including document selection, text extraction, language identification, deduplication, and quality assessment functionalities.

Research Background and Motivation

Problem Definition

  1. Data Scarcity: Large-scale, high-quality multilingual pretraining data is typically controlled by large enterprises, with limited resources available to the academic community
  2. Language Inequality: Existing datasets predominantly favor English, with severe data insufficiency for other languages, particularly low-resource languages
  3. Quality Control: Web-crawled data exhibits highly variable quality, necessitating systematic cleaning and filtering mechanisms
  4. Evaluation Standards: Lack of unified evaluation frameworks for multilingual models

Research Significance

  • Democratizing AI: Lowering the barriers to LLM development through open large-scale datasets
  • Multilingual Fairness: Providing increased training data for low-resource languages, promoting linguistic diversity
  • Academic Research: Furnishing the research community with reproducible experimental foundations

Limitations of Existing Approaches

  • Datasets such as C4 and FineWeb primarily focus on English
  • Multilingual datasets like MADLAD-400 have relatively limited scale
  • Lack of unified data processing and evaluation standards

Core Contributions

  1. Constructed a 300 trillion token ultra-large-scale multilingual dataset covering nearly 200 languages
  2. Developed a complete open-source data processing pipeline including text extraction, language identification, deduplication, and quality assessment
  3. Proposed the HPLT-E multilingual evaluation framework encompassing 127 tasks across 9 European languages
  4. Trained 57 monolingual encoder-decoder models and multiple GPT-style reference models
  5. Constructed large-scale parallel text datasets including automatically mined and machine translation-synthesized data
  6. Provided comprehensive data quality analysis including statistical analysis and manual inspection

Methodology Details

Data Collection and Processing Pipeline

Raw Data Sources

  • Internet Archive (IA): 3.3 PB of crawler data from 2012-2020
  • Common Crawl (CC): 57 complete snapshots (2014-2025), approximately 7.2 PB total

Core Processing Steps

  1. Text Extraction
    • Employing the Trafilatura framework for HTML text extraction
    • Optimizing hyperparameter settings, prioritizing extraction quality over speed
  2. Language Identification
    • Adopting the OpenLID-v2 model for language prediction
    • Supporting language labels in the Flores+ evaluation set
    • Improving preprocessing pipeline: space normalization, lowercasing, removal of non-word characters
  3. Deduplication Processing
    • Implementing global approximate deduplication based on MinHash for all languages except English, Russian, and Chinese
    • Employing per-crawler deduplication for large languages to enhance computational efficiency
  4. Quality Assessment and Annotation
    • Web Docs Scorer (WDS): Integrating heuristic document filtering methods
    • Register Labels: Using the Turku web register classifier to add stylistic tags for 104 languages
    • WDS Levels: Categorizing documents into six quality levels {5, 6, 7, 8, 9, 10}

Data Packaging and Distribution

  • Binning and globally sorting documents for each language by WDS level
  • Using Zstandard-compressed JSONlines format
  • Totaling approximately 50TB of data distributed across 3000 files

Experimental Setup

HPLT-E Evaluation Framework

Language Selection

Nine European languages: English, Spanish, French, German, Italian, Czech, Finnish, Norwegian, Ukrainian, etc.

Model Training Configuration

  • Architecture: Decoder model based on Llama architecture
  • Scale: 2.15B parameters, 24 layers, 32 attention heads
  • Training Data: 100B tokens per language
  • Sequence Length: 2048
  • Training Platform: LUMI supercomputer, 16 nodes with AMD MI250x GPUs

Evaluation Tasks

Comprising 127 language understanding and generation tasks, covering:

  • Textual entailment
  • Commonsense reasoning
  • Language-specific and world knowledge
  • Paraphrasing
  • Reading comprehension
  • Sentiment analysis
  • Toxicity detection
  • Factuality assessment

Encoder-Decoder Models

Model Configuration

  • Architecture: T5-base (approximately 275M parameters)
  • Language Coverage: 57 languages
  • Language Families: Spanning 14 language families

Evaluation Tasks

  1. Named Entity Recognition: WikiAnn benchmark
  2. Language Proficiency: MultiBLiMP benchmark

Experimental Results

Dataset Comparative Analysis

DatasetEnglish DocsEnglish TokensMultilingual DocsMultilingual TokensTotal Tokens
HPLT 3.018B16T11B13T29T
FineWeb24B17T5.0B4.9T22T
HPLT 2.04.4B3.9T6.1B7.2T11T
MADLAD-4001.5B1.7T2.1B2.7T4.4T

Multilingual LLM Evaluation Results

Dataset Performance Comparison

According to the HPLT-E framework evaluation, model performance ranking:

  1. MADLAD-400: Highest multilingual score
  2. HPLT 3.0: Second place, significantly outperforming previous versions
  3. HPLT 2.0 and FineWeb: Comparable performance

WDS Quality Level Experiments

  • Low-quality data (bottom WDS levels): Noticeably reduces model performance
  • High-quality data (top WDS levels): Comparable to random sampling performance, possibly due to insufficient diversity
  • Random sampling: Best performance on Spanish and French

Encoder-Decoder Model Results

Named Entity Recognition (WikiAnn F1 Score)

LanguageHPLT T5mT5-baseBERT HPLT
Catalan92.787.494.5
Czech91.685.291.8
English82.177.682.7
Basque92.082.892.9
Finnish90.31.891.6

Language Proficiency (MultiBLIMP Accuracy)

LanguageHPLT T5mT5-basemT5-xxl
Catalan95.691.693.0
Czech95.988.893.4
English94.290.695.3
Basque97.494.996.0

Average Performance: HPLT T5 model achieves 93.5% on MultiBLIMP, significantly outperforming mT5-base's 86.8%

Data Quality Analysis

Manual Inspection Results (24 languages)

  • Pornographic Content: Below 2% for most languages
  • Language Identification Errors: Generally low overall, though Bosnian dataset primarily contains Serbian, and Asturian frequently contains Spanish
  • Non-Natural Text: Varies considerably across languages, partially reflecting subjective annotation standards
  • Text Defects: Including navigation elements, truncated text, etc., proportions varying by language

Statistical Feature Improvements

  • Unique Paragraph Ratio: HPLT 3.0 at 73% vs HPLT 2.0 at 52%, reflecting global deduplication effectiveness
  • Domain Diversity: Reduced over-representation of Wikipedia pages compared to HPLT 2.0
  • Geographic TLD Distribution: Highly correlated with regional language usage

Large-Scale Pretraining Datasets

  • C4: Google and Allen AI's primarily English-focused dataset
  • FineWeb: Hugging Face's high-quality web data
  • MADLAD-400: Google's 400-language dataset
  • Nemotron-CC: Nvidia's refined Common Crawl data

Multilingual Model Evaluation

  • Existing Benchmarks: Predominantly biased toward English or limited high-resource languages
  • Evaluation Challenges: Prompt sensitivity, cross-lingual consistency, cultural bias, etc.

Data Processing Techniques

  • Text Extraction: Development of tools such as Trafilatura
  • Language Identification: Evolution from traditional methods to deep learning models
  • Deduplication Techniques: From exact matching to approximate matching methods

Conclusions and Discussion

Main Conclusions

  1. Scale Breakthrough: HPLT 3.0 with 300 trillion tokens represents the largest public multilingual pretraining dataset
  2. Quality Enhancement: Improved processing pipeline significantly enhances data quality, reflected in model performance
  3. Evaluation Innovation: HPLT-E framework establishes new standards for multilingual model evaluation
  4. Model Contribution: 57 monolingual encoder-decoder models provide practical tools for the community

Limitations

  1. Quality Assessment: Despite manual inspection, quality evaluation of large-scale data remains challenging
  2. Language Coverage: While supporting nearly 200 languages, resource distribution remains imbalanced
  3. Evaluation Scope: HPLT-E framework currently covers only 9 European languages
  4. Computational Resources: Large-scale training requires substantial computational resources, limiting reproducibility

Future Directions

  1. Data Expansion: Planning extended release including ArchiveBot data in early 2026
  2. Evaluation Extension: Expanding HPLT-E framework to more languages and tasks
  3. Quality Improvement: Continuing optimization of data processing pipeline and quality control mechanisms
  4. Application Research: Exploring synthetic data applications in low-resource languages

In-Depth Evaluation

Strengths

  1. Unprecedented Scale: 300 trillion tokens represents the largest scale among public datasets
  2. Open Transparency: Complete open-source pipeline and detailed technical documentation
  3. Systematic Approach: Complete ecosystem from data collection to model training
  4. Quality Control: Multi-layered quality assessment and manual verification mechanisms
  5. Practical Value: Provides directly usable pre-trained models

Weaknesses

  1. Computational Threshold: While data is open, training large models still requires substantial computational resources
  2. Quality Variance: Significant differences in data quality and quantity across languages
  3. Evaluation Limitations: Relatively small manual evaluation samples, potentially introducing bias
  4. Cultural Bias: Inherent geographic and cultural biases in web data difficult to completely eliminate

Impact

  1. Academic Contribution: Provides important infrastructure for multilingual NLP research
  2. Industry Impact: Lowers development barriers for multilingual AI applications
  3. Social Value: Promotes linguistic diversity and democratization of AI technology
  4. Standard Setting: HPLT-E evaluation framework may become industry standard

Applicable Scenarios

  1. Multilingual LLM Pretraining: Direct application to large language model pretraining
  2. Language-Specific Models: Developing specialized models for low-resource languages
  3. Cross-Lingual Research: Supporting linguistics and computational linguistics research
  4. Machine Translation: Providing parallel corpora and monolingual data
  5. Educational Applications: Furnishing resources for language learning and instruction

Technical Innovation Points

Data Processing Innovation

  1. Global Deduplication: Cross-crawler global approximate deduplication enhancing data diversity
  2. Quality Grading: WDS scoring system providing fine-grained quality control
  3. Multi-Dimensional Annotation: Combining register labels, quality assessment, PII detection, and other annotations

Evaluation Method Innovation

  1. Multi-Prompt Design: Each task supporting 3-7 human-written prompts, reducing prompt sensitivity
  2. Task Selection Criteria: Selecting evaluation tasks based on seven standards including monotonicity and stability
  3. Aggregation Methods: Combining multiple aggregation approaches including average scores, rankings, and Borda count

Model Training Innovation

  1. Language-Specific Models: Training specialized encoder-decoder models for 57 languages
  2. Intermediate Checkpoints: Providing intermediate checkpoints during training, supporting learning process research
  3. Synthetic Data: Generating additional pretraining data through machine translation

References

This paper cites extensive related work, primarily including:

  • Raffel et al. (2020): T5 model and C4 dataset
  • Penedo et al. (2024, 2025): FineWeb dataset series
  • Kudugunta et al. (2023): MADLAD-400 dataset
  • Burchell et al. (2025): HPLT 2.0 dataset
  • Multiple papers on multilingual evaluation benchmarks

Summary: The HPLT 3.0 project represents an important milestone in multilingual NLP, achieving breakthroughs not only in data scale but also establishing new benchmarks in openness, quality control, and evaluation standards. While certain limitations remain, it holds significant importance for promoting democratization and development of multilingual AI technology.