We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.
- Paper ID: 2510.10159
- Title: BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
- Authors: Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, and 27 other authors
- Classification: cs.CL (Computational Linguistics)
- Submission Date: October 11, 2025 to arXiv
- Paper Link: https://arxiv.org/abs/2510.10159
This paper introduces BabyBabelLM, a collection of multilingual datasets designed to simulate the linguistic environment humans encounter from birth through native language acquisition. The researchers carefully curated developmentally plausible pretraining data, targeting approximately 100 million English-equivalent words for each of 45 languages. An evaluation suite was compiled and baseline models were trained for each language. BabyBabelLM aims to facilitate research in multilingual pretraining and cognitive modeling.
Current language model research primarily focuses on scaling, pursuing larger models and more training data. However, this trend overlooks fundamental questions about language learning. Humans acquire linguistic competence from infancy to adulthood through exposure to fewer than 100 million English words, contrasting sharply with modern language models requiring over 100 trillion tokens—a difference of several orders of magnitude.
- Data Efficiency: Exploring efficient language modeling under limited data budgets
- Developmental Plausibility: Investigating training data composition aligned with human language acquisition processes
- Multilingual Coverage: Extending the scope of the BabyLM challenge from English to multilingual settings
- Cognitive Modeling: Providing resources for understanding relationships between human language acquisition and language model learning
- BabyLM challenge limited to English, lacking cross-linguistic validation
- Absence of systematic multilingual developmentally plausible datasets
- Existing research conducted in isolation, lacking coordinated data collection standards
- Uneven distribution of evaluation resources across languages
- Constructed developmentally plausible pretraining datasets covering 45 languages, organized into three tiers (100M, 10M, 1M English-equivalent words)
- Provided open-source data expansion pipeline supporting community contributions of new languages and dataset extensions
- Compiled comprehensive multilingual evaluation suite covering formal and functional linguistic abilities
- Trained 45 monolingual models, 7 bilingual models, and 1 multilingual model as baselines
- Established community-driven collaborative framework promoting continuous dataset expansion and improvement
- Child-Directed Speech (CDS): Transcriptions of adult speech to children
- Educational Materials: Textbooks and exam content designed for children
- Children's Media: Children's books, children's wikis, children's news
- Subtitle Content: Subtitles from child-appropriate films and television programs
- Exclusion of Synthetic Data: Avoiding artificially generated content such as TinyStories
Data collection for each language is led by researchers familiar with that language, ensuring data quality and cultural appropriateness.
- Transcription Data
- Child-directed speech: Caregiver-child interactions from the CHILDES database
- Child-accessible speech: Adult conversations children may incidentally overhear
- Educational Content
- Textbooks and exam materials designed for children
- Providing direct instruction, supplementing formal language patterns in CDS
- Books, Wiki, News
- Children's books, children's wiki articles, children's news
- Containing more complex sentence structures and diverse vocabulary
- Subtitles
- Subtitles from child-appropriate films and television programs
- Educational content subtitles from the QED corpus
- Padding Data
- OpenSubtitles corpus (filtered for inappropriate content)
- FineWeb-C and Wikipedia data as fallback
- Tier 1: 9 languages, approximately 100 million English-equivalent words
- Tier 2: 15 languages, approximately 10 million English-equivalent words
- Tier 3: 21 languages, approximately 1 million English-equivalent words
Initial processing conducted by language leads according to specific language and data requirements.
- Normalization: Unicode, whitespace, and punctuation standardization
- Category-Specific Processing:
- Dialogue transcripts: Removal of linguistic annotations
- Subtitle data: Removal of speaker labels, musical symbols, stage directions
- Book formats: Removal of XML tags and URLs
- Language Validation: Language identification and verification using GlotLID v3
- Monolingual Models: GPT-2 architecture, 4 transformer layers, 8 attention heads, hidden dimension 512
- Bilingual Models: Combining target language and English data (200M words total)
- Multilingual Model: 12 layers, hidden dimension 768, vocabulary size 32,768, 111M parameters
- Vocabulary Size: 8,192 (monolingual), 32,768 (multilingual)
- Training Strategy: BPE tokenization, 10 epochs (monolingual), 5 epochs (bilingual), 1 epoch (multilingual)
- MonoBLiMP: Language-specific minimal contrast benchmarks
- MultiBLiMP: Large-scale minimal contrast dataset based on Universal Dependencies
- CLAMS: Cross-lingual subject-verb agreement benchmark
- Knowledge-Intensive Tasks: Global-MMLU, INCLUDE, BM-LAMA
- Reasoning Tasks: XNLI, HellaSwag, Belebele, ARC, XCOPA, etc.
- Zero-Shot Evaluation: Minimal contrast comparison based on model output probabilities
- Fine-Tuning Evaluation: Classification and question-answering tasks, up to 8,000 training samples, 10 epochs
- Baseline Models: Random performance
- Comparative Models: Qwen3-0.6B (appropriately-scaled multilingual model)
- Architecture Comparison: GPT-BERT vs GPT-2
- MultiBLiMP Tasks: Tier 1 languages typically exceed 80% accuracy, demonstrating strong grammatical learning ability
- Other Benchmarks: Most task performance approaches random levels, reflecting data scale limitations
- Data Scale Impact: Tier 1 > Tier 2 > Tier 3, demonstrating the importance of data quantity on performance
- MultiBLiMP: Monolingual models typically outperform multilingual models, except for 4 Tier 3 languages
- Belebele: Both model types approach random performance, while Qwen performs significantly better
- Overall Trend: Qwen surpasses the proposed models on most tasks, but the multilingual model outperforms on 8 languages
- Knowledge-Intensive Tasks: SIB-200, BM-LAMA, XCOMPS, INCLUDE show consistent performance improvements
- Grammatical Tasks: MultiBLiMP performance remains essentially unchanged, indicating syntactic ability is less sensitive to bilingual input
- Special Cases: Dutch shows slight decline on INCLUDE task, possibly due to domain mismatch
- GPT-2 models consistently outperform GPT-BERT on SIB-200 and MultiBLiMP tasks
- Results indicate GPT-2 architecture is better suited for small-scale data training in the current configuration
- Tier 1 Languages: Chinese, French, Bulgarian, etc., with relatively abundant developmentally plausible data
- Tier 2 Languages: Japanese, Serbian, Cantonese, etc., with moderate data quantities
- Tier 3 Languages: Mostly low-resource languages, primarily relying on multilingual resource padding
- First Edition: 10M and 100M word English corpus, 39% developmentally plausible data
- Second Edition: Increased to 70% child-directed data
- Evaluation Methods: Zero-shot minimal contrast and fine-tuning evaluation
- Salhan et al. (2024): Curriculum learning for French, German, Japanese, and Chinese acquisition
- Prévot et al. (2024): Spontaneous speech corpus research for English and French
- Matzopoulos et al. (2025): BabyLM research for isiXhosa, highlighting low-resource language challenges
- CHILDES: Child-adult interaction database for 40+ languages
- MAO-CHILDES: Age-ordered dataset for 5 languages
- IPA-CHILDES: Phonemicized corpus for 31 languages
- Feasibility Validation: Successfully constructed developmentally plausible datasets for 45 languages, demonstrating the feasibility of multilingual BabyLM research
- Data Quantity Impact: More developmentally plausible data indeed enhances grammatical learning ability, particularly on MultiBLiMP tasks
- Bilingual Benefits: Consistent performance improvements on knowledge-intensive tasks with bilingual training
- Architecture Selection: GPT-2 architecture outperforms GPT-BERT under small-scale data settings
- Uneven Language Coverage: Despite covering 45 languages, African languages and minority languages remain underrepresented
- Data Composition Variance: Significant differences in developmental plausibility ratios across languages may affect cross-linguistic comparisons
- Evaluation Resource Constraints: Lack of standardized evaluation benchmarks covering all languages
- Data Approximation: Datasets represent only rough approximations of actual child language input
- Expand Language Coverage: Particularly African languages and other low-resource languages
- Improve Data Quality: Collect more high-quality child-directed speech data
- Standardize Evaluation: Develop cross-linguistically consistent evaluation frameworks
- Multilingual Ability Research: Investigate bilingual and multilingual acquisition mechanisms in depth
- Systematic Contribution: First systematic construction of large-scale multilingual developmentally plausible datasets
- Community-Oriented: Established sustainable community-driven data collection framework
- Methodological Rigor: Employed byte-equivalent calibration ensuring cross-linguistic data quantity comparability
- Strong Openness: Complete release of data, code, and models promoting reproducible research
- High Practical Value: Provides important resources for multilingual cognitive modeling and data efficiency research
- Inconsistent Data Quality: Significant variation in developmental plausibility ratios across languages
- Limited Model Performance: Baseline models approach random performance on most tasks
- Incomplete Evaluation Coverage: Some languages lack sufficient evaluation benchmarks
- Insufficient Theoretical Analysis: Lacks in-depth analysis of why certain languages or tasks perform better
- Field Contribution: Fills gap in multilingual developmentally plausible datasets, advancing related research
- Practical Value: Provides important starting point for low-resource language model research
- Reproducibility: Complete open-source resources ensure research reproducibility and scalability
- Community Building: Establishes sustainable collaborative framework promoting long-term development
- Cognitive Linguistics Research: Exploring relationships between human language acquisition and machine learning
- Low-Resource Language Modeling: Providing training starting points for resource-scarce languages
- Multilingual Education: Supporting bilingual and multilingual learning research
- Data Efficiency Research: Investigating model training strategies under limited data budgets
- Byte-Equivalent Calibration: Adjusting data quantities across languages using UTF-8 encoding size, ensuring fair comparison
- Hierarchical Data Organization: Stratifying languages into three tiers based on available data, balancing coverage and quality
- Community-Driven Quality Control: Each language managed by native or fluent speakers, ensuring cultural and linguistic appropriateness
- Dual-Mode Evaluation: Combining zero-shot and fine-tuning evaluation for comprehensive ability assessment
- Cross-Linguistic Consistency: Using tools like MultiBLiMP to ensure cross-linguistic evaluation comparability
- Capability-Stratified Evaluation: Distinguishing between formal and functional linguistic ability assessment
- Complete Resource Release: Data, code, and models all open-sourced
- Scalable Design: Standardized pipeline supporting community contributions
- Transparent Documentation: Detailed information on data sources, licenses, and preprocessing
This work makes important contributions to the intersection of multilingual language model research and cognitive linguistics, establishing a sustainable research platform with potential to advance understanding of human language acquisition mechanisms.