Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Marinas, Kucherenko, Sternfeld et al.
The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet.
In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety.
We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.
대규모 언어 모델(LLMs)의 성능은 훈련 데이터에 따라 결정된다. 개방형 가중치 LLMs가 증가하고 있음에도 불구하고, LLM 훈련 데이터에 대한 접근은 여전히 제한적이다. 완전히 개방된 LLMs의 경우에도, 데이터 규모로 인해 일반 과학 커뮤니티가 심층 분석을 수행하기 어렵다. 본 논문은 Apertus LLM 훈련 데이터의 전문 텍스트 인덱싱 파이프라인을 제시한다. Elasticsearch 병렬 인덱싱과 Alps 인프라(최첨단 고효율 arm64 슈퍼컴퓨터)를 활용하여, Apertus LLM 계열 훈련에 사용된 15.2T 토큰 중 8.6T 토큰을 성공적으로 인덱싱했으며, 이는 중요한 LLM 안전 도구이자 오프라인 정선 개방형 웹 검색 엔진을 구축했다.