Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Marinas, Kucherenko, Sternfeld et al.
The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet.
In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety.
We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.
academic
Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
The performance of Large Language Models (LLMs) depends critically on their training data. Despite the increasing availability of open-weight LLMs, access to LLM training data remains restricted. Even for fully open LLMs, the scale of data makes in-depth analysis challenging for the broader scientific community, despite potentially containing critical data scraped from the internet. This paper presents a full-text indexing pipeline for Apertus LLM training data. Utilizing Elasticsearch parallel indexing and the Alps infrastructure (a state-of-the-art high-efficiency ARM64 supercluster), we successfully indexed 8.6T tokens out of 15.2T tokens used for training the Apertus LLM family, creating a critical LLM safety tool and an offline, curated open web search engine.
Lack of Training Data Transparency: Despite the proliferation of open-weight LLMs, training data remains difficult to access and analyze
Data Scale Challenges: Modern LLM training data is massive (trillions of tokens), making systematic inspection nearly impossible
Safety Risks: Training data may contain harmful content, including personal information, copyrighted material, toxic language, and even dangerous information
ARM64 Architecture Migration: First demonstration of successful Elasticsearch deployment on ARM64-based GH200 HPC systems
Large-Scale Indexing Implementation: Indexed 8.6T tokens dataset, 4 times larger than previous Elasticsearch-based indices and 2 times the overall scale
LLM Safety Applications: Demonstrated full-text indexing applications in LLM safety and security use cases, providing safeguards without jailbreaking
Open-Source Contribution: Provided complete open-source code and performance benchmarks to support future research
Common chemical substances (e.g., glycerin, nitric acid) appear with extremely high frequency, while specialized chemical weapons synthesis terminology also shows significant presence in non-English languages, highlighting the importance of multilingual data curation.
The paper cites 60 relevant references covering multiple domains including LLM training, data security, and full-text search, providing a solid theoretical foundation for the research.
Overall Assessment: This is a technically valuable paper that successfully addresses critical issues in LLM training data transparency and safety analysis. While facing certain limitations in data coverage and technical adaptation, its pioneering work provides important technical foundations and practical guidance for the field.