2025-11-22T04:49:16.383386

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Marinas, Kucherenko, Sternfeld et al.

The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.

academic

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Basic Information

Paper ID: 2510.09471
Title: Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Authors: Inés Altemir Mariñas (EPFL), Anastasiia Kucherenko (HES-SO Valais-Wallis), Alexander Sternfeld (HES-SO Valais-Wallis), Andrei Kucharavy (HES-SO Valais-Wallis)
Classification: cs.CL (Computational Linguistics)
Conference: WWW '26 (The Web Conference 2026)
Paper Link: https://arxiv.org/abs/2510.09471

Abstract

The performance of Large Language Models (LLMs) depends critically on their training data. Despite the increasing availability of open-weight LLMs, access to LLM training data remains restricted. Even for fully open LLMs, the scale of data makes in-depth analysis challenging for the broader scientific community, despite potentially containing critical data scraped from the internet. This paper presents a full-text indexing pipeline for Apertus LLM training data. Utilizing Elasticsearch parallel indexing and the Alps infrastructure (a state-of-the-art high-efficiency ARM64 supercluster), we successfully indexed 8.6T tokens out of 15.2T tokens used for training the Apertus LLM family, creating a critical LLM safety tool and an offline, curated open web search engine.

Research Background and Motivation

Core Problems

Lack of Training Data Transparency: Despite the proliferation of open-weight LLMs, training data remains difficult to access and analyze
Data Scale Challenges: Modern LLM training data is massive (trillions of tokens), making systematic inspection nearly impossible
Safety Risks: Training data may contain harmful content, including personal information, copyrighted material, toxic language, and even dangerous information

Research Significance

LLM Safety: Problematic content in training data directly affects model behavior, leading to harmful outputs
Transparency Requirements: The scientific community and regulatory bodies need the ability to audit LLM training data
Compliance Needs: Identification and removal of copyrighted content, personal information, etc.

Limitations of Existing Approaches

Sampling-based Analysis: Existing tools rely primarily on small samples (e.g., 1% of Common Crawl), failing to ensure comprehensive coverage
Scale Limitations: Previous largest full-text index (Infinigram) supported only 4.6T tokens and only exact matching
Limited Functionality: Lack of fuzzy search and logical operation capabilities

Core Contributions

ARM64 Architecture Migration: First demonstration of successful Elasticsearch deployment on ARM64-based GH200 HPC systems
Large-Scale Indexing Implementation: Indexed 8.6T tokens dataset, 4 times larger than previous Elasticsearch-based indices and 2 times the overall scale
LLM Safety Applications: Demonstrated full-text indexing applications in LLM safety and security use cases, providing safeguards without jailbreaking
Open-Source Contribution: Provided complete open-source code and performance benchmarks to support future research

Methodology Details

Task Definition

Build a system capable of full-text search on trillion-token-scale LLM training data, supporting:

Exact and fuzzy matching
Multi-language content search
Logical operations and complex queries
Real-time search responses

System Architecture

1. Data Processing Pipeline

Raw Parquet Files → Stream Processing → Text Analysis → Elasticsearch Index

2. Core Components

Elasticsearch Engine: Distributed search and analytics engine
Parallel Indexing: Multi-threaded concurrent processing using elasticsearch.helpers.parallel_bulk
Text Analyzer: web_content_analyzer performs HTML cleaning, standard tokenization, lowercasing, and ASCII folding

3. Key Parameter Tuning

Thread Count: Not exceeding CPU core count, balancing concurrency and memory pressure
Chunk Size: Determined by formula chunk_size ≤ max_chunk_size / avg_doc_size
Maximum Chunk Bytes: Controls maximum payload of bulk requests
Queue Size: Buffers imbalances between producer and consumer threads

Technical Innovations

1. ARM64 Adaptation

Construction of OCI-compatible custom container images
Resolution of Docker compatibility issues, using Podman as alternative
Re-implementation of orchestration through SLURM job definitions

2. HPC Environment Optimization

Disabling memory mapping to accommodate kernel parameter limitations
Network configuration bypassing proxies, binding to 127.0.0.1
Single-node operation mode adapted to SLURM job isolation

3. Query Optimization

match_phrase_query: Supports configurable word distance tolerance (SLOP parameter)
Multi-level text processing: HTML cleaning → standard tokenization → normalization → ASCII folding

Experimental Setup

Dataset

Apertus Training Data Subset (8.6T tokens, 58% of total training data):

Dataset	Tokens (B)
FineWeb-Edu (Score-2)	4,815
FineWeb-2-HQ (33% highest quality)	3,557
StarCoder	235
FineMath CommonCrawl Subset	32
Gutenberg and Poison	2

Query Datasets

Weaponized Words Dictionary: Harmful vocabulary across 137 languages
LDNOOBW List: Profanity across 28 languages
Chemical Weapons Dataset: 17 dangerous chemical reagent terms

Computing Environment

Alps Supercomputer: HPE Cray EX system, 434 PFlops performance
Node Configuration: ARM64-based NVIDIA Grace Hopper GH200
Storage System: 100PB ClusterStor HDD + 3PB SSD + 1PB VAST

Experimental Results

Indexing Performance

Dataset	Data Size (GB)	Time (h)	Indexing Rate (doc/s)	Indexing Overhead Ratio	Peak Memory (GB)
FineWeb-2 Edu (EN)	12,737	143.7	10,296	1.3	4.9
FineWeb-2 Europe HQ	2,660	408.3	589	1.1	7.5
StarCoder	229	4.2	10,919	1.4	12.7

Key Findings:

English text indexing speed significantly faster than multilingual datasets (10,297 vs 589 doc/s)
Code data requires more memory resources (12.7GB vs 4.9GB)
Multilingual datasets incur higher indexing overhead

Query Performance

Query time increases linearly with query length
Single-word queries: <100ms
300-word queries: ~1000ms
System maintains stable performance across various query lengths

Harmful Content Analysis

Multilingual Harmful Vocabulary Statistics

Language	Weaponized Words (Million)	LDNOOBW (Million)
English	1,245.8	661.6
French	16.8	202.5
German	9.9	14.9
Italian	1.6	18.5

Common chemical substances (e.g., glycerin, nitric acid) appear with extremely high frequency, while specialized chemical weapons synthesis terminology also shows significant presence in non-English languages, highlighting the importance of multilingual data curation.

Existing LLM Data Analysis Tools

Data Portraits: Uses approximate membership inference to reduce computational costs
Statistical Sampling Methods: Such as Luccioni et al.'s analysis of 1% of Common Crawl
Small-Scale Dataset Tools: HuggingFace's Data Measurements, Google's Know Your Data

Large-Scale Indexing Systems

WhatIsInMyBigData: Maximum index of 1.4T tokens (RedPajama)
Infinigram: Uses suffix arrays, supports 4.6T tokens exact search
ROOTS Tools: Fuzzy and exact search for 1.6TB multilingual corpus

Advantages of This Work

Scale: 8.6T tokens, surpassing existing Elasticsearch-based systems by 4 times
Functionality: Supports fuzzy search and logical operations
Multilingual: Covers safety analysis across multiple languages

Conclusions and Discussion

Main Conclusions

Technical Feasibility: Demonstrated the feasibility of deploying Elasticsearch on ARM64 architecture
Scale Achievability: Full-text indexing of trillion-token-scale data is achievable for small teams
Safety Applications: Full-text indexing can be applied to deep safety analysis of LLM training data

Limitations

Coverage Scope: Only indexed 58% of Apertus training data
Architecture Constraints: ARM64 adaptation still faces compatibility challenges
Memory Mapping: Inability to use memory mapping reduces I/O efficiency

Future Directions

Complete Internet Indexing: Building offline search indices for the entire open web
LLM Fact Grounding: Verification of LLM-generated content based on offline search
Economic and Ethical Issues: Fair compensation mechanisms for content creators

In-Depth Evaluation

Strengths

High Practical Value: Addresses the critical issue of LLM training data transparency
Significant Technical Contributions: First implementation of trillion-token-scale Elasticsearch indexing
Open-Source Friendly: Provides complete code and detailed deployment guidelines
Clear Safety Applications: Demonstrates concrete LLM safety use cases
Environmentally Conscious: Uses energy-efficient ARM64 architecture with only 90kg CO2eq emissions

Weaknesses

Incomplete Data Coverage: Did not index all training data
ARM64 Challenges: Complex technical adaptation process may limit adoption
Performance Trade-offs: Sacrificed some I/O performance to accommodate HPC environment
Limited Depth of Safety Analysis: Relatively surface-level analysis of harmful content

Impact

Academic Contribution: Provides new technical pathways for LLM training data analysis
Practical Value: Directly applicable to LLM safety audits
Technology Promotion: Promotes ARM64 adoption in enterprise applications
Policy Support: Provides technical tools for LLM regulation

Applicable Scenarios

LLM Development Teams: Training data quality control and safety audits
Research Institutions: Large-scale text data analysis and mining
Regulatory Bodies: LLM compliance checking and risk assessment
Enterprise Applications: Content filtering and data governance

References

The paper cites 60 relevant references covering multiple domains including LLM training, data security, and full-text search, providing a solid theoretical foundation for the research.

Overall Assessment: This is a technically valuable paper that successfully addresses critical issues in LLM training data transparency and safety analysis. While facing certain limitations in data coverage and technical adaptation, its pioneering work provides important technical foundations and practical guidance for the field.