2025-11-22T04:49:16.383386

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Marinas, Kucherenko, Sternfeld et al.
The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.
academic

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Basic Information

  • Paper ID: 2510.09471
  • Title: Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
  • Authors: Inés Altemir Mariñas (EPFL), Anastasiia Kucherenko (HES-SO Valais-Wallis), Alexander Sternfeld (HES-SO Valais-Wallis), Andrei Kucharavy (HES-SO Valais-Wallis)
  • Classification: cs.CL (Computational Linguistics)
  • Conference: WWW '26 (The Web Conference 2026)
  • Paper Link: https://arxiv.org/abs/2510.09471

Abstract

The performance of Large Language Models (LLMs) depends critically on their training data. Despite the increasing availability of open-weight LLMs, access to LLM training data remains restricted. Even for fully open LLMs, the scale of data makes in-depth analysis challenging for the broader scientific community, despite potentially containing critical data scraped from the internet. This paper presents a full-text indexing pipeline for Apertus LLM training data. Utilizing Elasticsearch parallel indexing and the Alps infrastructure (a state-of-the-art high-efficiency ARM64 supercluster), we successfully indexed 8.6T tokens out of 15.2T tokens used for training the Apertus LLM family, creating a critical LLM safety tool and an offline, curated open web search engine.

Research Background and Motivation

Core Problems

  1. Lack of Training Data Transparency: Despite the proliferation of open-weight LLMs, training data remains difficult to access and analyze
  2. Data Scale Challenges: Modern LLM training data is massive (trillions of tokens), making systematic inspection nearly impossible
  3. Safety Risks: Training data may contain harmful content, including personal information, copyrighted material, toxic language, and even dangerous information

Research Significance

  • LLM Safety: Problematic content in training data directly affects model behavior, leading to harmful outputs
  • Transparency Requirements: The scientific community and regulatory bodies need the ability to audit LLM training data
  • Compliance Needs: Identification and removal of copyrighted content, personal information, etc.

Limitations of Existing Approaches

  • Sampling-based Analysis: Existing tools rely primarily on small samples (e.g., 1% of Common Crawl), failing to ensure comprehensive coverage
  • Scale Limitations: Previous largest full-text index (Infinigram) supported only 4.6T tokens and only exact matching
  • Limited Functionality: Lack of fuzzy search and logical operation capabilities

Core Contributions

  1. ARM64 Architecture Migration: First demonstration of successful Elasticsearch deployment on ARM64-based GH200 HPC systems
  2. Large-Scale Indexing Implementation: Indexed 8.6T tokens dataset, 4 times larger than previous Elasticsearch-based indices and 2 times the overall scale
  3. LLM Safety Applications: Demonstrated full-text indexing applications in LLM safety and security use cases, providing safeguards without jailbreaking
  4. Open-Source Contribution: Provided complete open-source code and performance benchmarks to support future research

Methodology Details

Task Definition

Build a system capable of full-text search on trillion-token-scale LLM training data, supporting:

  • Exact and fuzzy matching
  • Multi-language content search
  • Logical operations and complex queries
  • Real-time search responses

System Architecture

1. Data Processing Pipeline

Raw Parquet Files → Stream Processing → Text Analysis → Elasticsearch Index

2. Core Components

  • Elasticsearch Engine: Distributed search and analytics engine
  • Parallel Indexing: Multi-threaded concurrent processing using elasticsearch.helpers.parallel_bulk
  • Text Analyzer: web_content_analyzer performs HTML cleaning, standard tokenization, lowercasing, and ASCII folding

3. Key Parameter Tuning

  • Thread Count: Not exceeding CPU core count, balancing concurrency and memory pressure
  • Chunk Size: Determined by formula chunk_size ≤ max_chunk_size / avg_doc_size
  • Maximum Chunk Bytes: Controls maximum payload of bulk requests
  • Queue Size: Buffers imbalances between producer and consumer threads

Technical Innovations

1. ARM64 Adaptation

  • Construction of OCI-compatible custom container images
  • Resolution of Docker compatibility issues, using Podman as alternative
  • Re-implementation of orchestration through SLURM job definitions

2. HPC Environment Optimization

  • Disabling memory mapping to accommodate kernel parameter limitations
  • Network configuration bypassing proxies, binding to 127.0.0.1
  • Single-node operation mode adapted to SLURM job isolation

3. Query Optimization

  • match_phrase_query: Supports configurable word distance tolerance (SLOP parameter)
  • Multi-level text processing: HTML cleaning → standard tokenization → normalization → ASCII folding

Experimental Setup

Dataset

Apertus Training Data Subset (8.6T tokens, 58% of total training data):

DatasetTokens (B)
FineWeb-Edu (Score-2)4,815
FineWeb-2-HQ (33% highest quality)3,557
StarCoder235
FineMath CommonCrawl Subset32
Gutenberg and Poison2

Query Datasets

  1. Weaponized Words Dictionary: Harmful vocabulary across 137 languages
  2. LDNOOBW List: Profanity across 28 languages
  3. Chemical Weapons Dataset: 17 dangerous chemical reagent terms

Computing Environment

  • Alps Supercomputer: HPE Cray EX system, 434 PFlops performance
  • Node Configuration: ARM64-based NVIDIA Grace Hopper GH200
  • Storage System: 100PB ClusterStor HDD + 3PB SSD + 1PB VAST

Experimental Results

Indexing Performance

DatasetData Size (GB)Time (h)Indexing Rate (doc/s)Indexing Overhead RatioPeak Memory (GB)
FineWeb-2 Edu (EN)12,737143.710,2961.34.9
FineWeb-2 Europe HQ2,660408.35891.17.5
StarCoder2294.210,9191.412.7

Key Findings:

  • English text indexing speed significantly faster than multilingual datasets (10,297 vs 589 doc/s)
  • Code data requires more memory resources (12.7GB vs 4.9GB)
  • Multilingual datasets incur higher indexing overhead

Query Performance

  • Query time increases linearly with query length
  • Single-word queries: <100ms
  • 300-word queries: ~1000ms
  • System maintains stable performance across various query lengths

Harmful Content Analysis

Multilingual Harmful Vocabulary Statistics

LanguageWeaponized Words (Million)LDNOOBW (Million)
English1,245.8661.6
French16.8202.5
German9.914.9
Italian1.618.5

Common chemical substances (e.g., glycerin, nitric acid) appear with extremely high frequency, while specialized chemical weapons synthesis terminology also shows significant presence in non-English languages, highlighting the importance of multilingual data curation.

Existing LLM Data Analysis Tools

  1. Data Portraits: Uses approximate membership inference to reduce computational costs
  2. Statistical Sampling Methods: Such as Luccioni et al.'s analysis of 1% of Common Crawl
  3. Small-Scale Dataset Tools: HuggingFace's Data Measurements, Google's Know Your Data

Large-Scale Indexing Systems

  1. WhatIsInMyBigData: Maximum index of 1.4T tokens (RedPajama)
  2. Infinigram: Uses suffix arrays, supports 4.6T tokens exact search
  3. ROOTS Tools: Fuzzy and exact search for 1.6TB multilingual corpus

Advantages of This Work

  • Scale: 8.6T tokens, surpassing existing Elasticsearch-based systems by 4 times
  • Functionality: Supports fuzzy search and logical operations
  • Multilingual: Covers safety analysis across multiple languages

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: Demonstrated the feasibility of deploying Elasticsearch on ARM64 architecture
  2. Scale Achievability: Full-text indexing of trillion-token-scale data is achievable for small teams
  3. Safety Applications: Full-text indexing can be applied to deep safety analysis of LLM training data

Limitations

  1. Coverage Scope: Only indexed 58% of Apertus training data
  2. Architecture Constraints: ARM64 adaptation still faces compatibility challenges
  3. Memory Mapping: Inability to use memory mapping reduces I/O efficiency

Future Directions

  1. Complete Internet Indexing: Building offline search indices for the entire open web
  2. LLM Fact Grounding: Verification of LLM-generated content based on offline search
  3. Economic and Ethical Issues: Fair compensation mechanisms for content creators

In-Depth Evaluation

Strengths

  1. High Practical Value: Addresses the critical issue of LLM training data transparency
  2. Significant Technical Contributions: First implementation of trillion-token-scale Elasticsearch indexing
  3. Open-Source Friendly: Provides complete code and detailed deployment guidelines
  4. Clear Safety Applications: Demonstrates concrete LLM safety use cases
  5. Environmentally Conscious: Uses energy-efficient ARM64 architecture with only 90kg CO2eq emissions

Weaknesses

  1. Incomplete Data Coverage: Did not index all training data
  2. ARM64 Challenges: Complex technical adaptation process may limit adoption
  3. Performance Trade-offs: Sacrificed some I/O performance to accommodate HPC environment
  4. Limited Depth of Safety Analysis: Relatively surface-level analysis of harmful content

Impact

  1. Academic Contribution: Provides new technical pathways for LLM training data analysis
  2. Practical Value: Directly applicable to LLM safety audits
  3. Technology Promotion: Promotes ARM64 adoption in enterprise applications
  4. Policy Support: Provides technical tools for LLM regulation

Applicable Scenarios

  1. LLM Development Teams: Training data quality control and safety audits
  2. Research Institutions: Large-scale text data analysis and mining
  3. Regulatory Bodies: LLM compliance checking and risk assessment
  4. Enterprise Applications: Content filtering and data governance

References

The paper cites 60 relevant references covering multiple domains including LLM training, data security, and full-text search, providing a solid theoretical foundation for the research.


Overall Assessment: This is a technically valuable paper that successfully addresses critical issues in LLM training data transparency and safety analysis. While facing certain limitations in data coverage and technical adaptation, its pioneering work provides important technical foundations and practical guidance for the field.