2025-11-11T15:01:09.602202

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Oepen, Arefev, Aulamo et al.

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

academic

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

基本信息

论文ID: 2511.01066
标题: HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
作者: Stephan Oepen等来自多个欧洲学术机构的研究者
分类: cs.CL (计算语言学)
发表时间: 2025年11月
论文链接: https://arxiv.org/abs/2511.01066

摘要

本文介绍了HPLT 3.0项目，这是一个旨在为近200种语言提供开放、超大规模、高质量且富含注释的文本数据集的倡议。该数据集包含30万亿个token，可能是目前最大的公开可用多语言LLM预训练数据集合。数据集来源于不同的网络爬虫，并配备了完整的开源处理流水线，包括文档选择、文本提取、语言识别、去重、质量评估等功能。

研究背景与动机

问题定义

数据稀缺问题: 大规模高质量的多语言预训练数据通常由大型企业控制，学术界缺乏可获得的资源
语言不平等: 现有数据集主要偏向英语，其他语言特别是低资源语言的数据严重不足
质量控制: 网络爬取数据质量参差不齐，需要系统的清洗和过滤机制
评估标准: 缺乏统一的多语言模型评估框架

研究重要性

民主化AI: 通过开放大规模数据集，降低LLM研发的门槛
多语言公平性: 为低资源语言提供更多训练数据，促进语言多样性
学术研究: 为研究界提供可复现的实验基础

现有方法局限性

C4、FineWeb等数据集主要关注英语
MADLAD-400等多语言数据集规模相对较小
缺乏统一的数据处理和评估标准

核心贡献

构建了30万亿token的超大规模多语言数据集，覆盖近200种语言
开发了完整的开源数据处理流水线，包括文本提取、语言识别、去重、质量评估等
提出了HPLT-E多语言评估框架，涵盖9种欧洲语言的127个任务
训练了57个单语言编码器-解码器模型和多个GPT风格的参考模型
构建了大规模平行文本数据集，包括自动挖掘和机器翻译合成的数据
提供了全面的数据质量分析，包括统计分析和人工检查

方法详解

数据收集与处理流水线

原始数据来源

Internet Archive (IA): 3.3 PB的2012-2020年爬虫数据
Common Crawl (CC): 57个完整快照(2014-2025)，约7.2 PB总量

核心处理步骤

文本提取
- 使用Trafilatura框架进行HTML文本提取
- 优化超参数设置，优先考虑提取质量而非速度
语言识别
- 采用OpenLID-v2模型进行语言预测
- 支持Flores+评估集中的语言标签
- 改进预处理流程：空格标准化、小写化、去除非词字符
去重处理
- 对除英语、俄语、中文外的所有语言实施基于MinHash的全局近似去重
- 大语言采用按爬虫去重以提高计算效率
质量评估与注释
- Web Docs Scorer (WDS): 集成启发式文档过滤方法
- 注册标签: 使用Turku网络注册分类器为104种语言添加文体标签
- WDS等级: 将文档按质量分为{5,6,7,8,9,10}六个等级