2025-11-24T05:22:18.264640

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Jumelet, Fourtassi, Haga et al.
We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.
academic

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Basic Information

  • Paper ID: 2510.10159
  • Title: BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
  • Authors: Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, and 27 other authors
  • Classification: cs.CL (Computational Linguistics)
  • Submission Date: October 11, 2025 to arXiv
  • Paper Link: https://arxiv.org/abs/2510.10159

Abstract

This paper introduces BabyBabelLM, a collection of multilingual datasets designed to simulate the linguistic environment humans encounter from birth through native language acquisition. The researchers carefully curated developmentally plausible pretraining data, targeting approximately 100 million English-equivalent words for each of 45 languages. An evaluation suite was compiled and baseline models were trained for each language. BabyBabelLM aims to facilitate research in multilingual pretraining and cognitive modeling.

Research Background and Motivation

Problem Definition

Current language model research primarily focuses on scaling, pursuing larger models and more training data. However, this trend overlooks fundamental questions about language learning. Humans acquire linguistic competence from infancy to adulthood through exposure to fewer than 100 million English words, contrasting sharply with modern language models requiring over 100 trillion tokens—a difference of several orders of magnitude.

Research Motivation

  1. Data Efficiency: Exploring efficient language modeling under limited data budgets
  2. Developmental Plausibility: Investigating training data composition aligned with human language acquisition processes
  3. Multilingual Coverage: Extending the scope of the BabyLM challenge from English to multilingual settings
  4. Cognitive Modeling: Providing resources for understanding relationships between human language acquisition and language model learning

Limitations of Existing Approaches

  • BabyLM challenge limited to English, lacking cross-linguistic validation
  • Absence of systematic multilingual developmentally plausible datasets
  • Existing research conducted in isolation, lacking coordinated data collection standards
  • Uneven distribution of evaluation resources across languages

Core Contributions

  1. Constructed developmentally plausible pretraining datasets covering 45 languages, organized into three tiers (100M, 10M, 1M English-equivalent words)
  2. Provided open-source data expansion pipeline supporting community contributions of new languages and dataset extensions
  3. Compiled comprehensive multilingual evaluation suite covering formal and functional linguistic abilities
  4. Trained 45 monolingual models, 7 bilingual models, and 1 multilingual model as baselines
  5. Established community-driven collaborative framework promoting continuous dataset expansion and improvement

Methodology

Data Collection Principles

Developmental Plausibility Criteria

  • Child-Directed Speech (CDS): Transcriptions of adult speech to children
  • Educational Materials: Textbooks and exam content designed for children
  • Children's Media: Children's books, children's wikis, children's news
  • Subtitle Content: Subtitles from child-appropriate films and television programs
  • Exclusion of Synthetic Data: Avoiding artificially generated content such as TinyStories

Community-Driven Data Leadership

Data collection for each language is led by researchers familiar with that language, ensuring data quality and cultural appropriateness.

Dataset Composition

Data Categories

  1. Transcription Data
    • Child-directed speech: Caregiver-child interactions from the CHILDES database
    • Child-accessible speech: Adult conversations children may incidentally overhear
  2. Educational Content
    • Textbooks and exam materials designed for children
    • Providing direct instruction, supplementing formal language patterns in CDS
  3. Books, Wiki, News
    • Children's books, children's wiki articles, children's news
    • Containing more complex sentence structures and diverse vocabulary
  4. Subtitles
    • Subtitles from child-appropriate films and television programs
    • Educational content subtitles from the QED corpus
  5. Padding Data
    • OpenSubtitles corpus (filtered for inappropriate content)
    • FineWeb-C and Wikipedia data as fallback

Language Stratification

  • Tier 1: 9 languages, approximately 100 million English-equivalent words
  • Tier 2: 15 languages, approximately 10 million English-equivalent words
  • Tier 3: 21 languages, approximately 1 million English-equivalent words

Data Preprocessing

Language-Specific Preprocessing

Initial processing conducted by language leads according to specific language and data requirements.

Unified Processing Pipeline

  1. Normalization: Unicode, whitespace, and punctuation standardization
  2. Category-Specific Processing:
    • Dialogue transcripts: Removal of linguistic annotations
    • Subtitle data: Removal of speaker labels, musical symbols, stage directions
    • Book formats: Removal of XML tags and URLs
  3. Language Validation: Language identification and verification using GlotLID v3

Experimental Setup

Model Configuration

  • Monolingual Models: GPT-2 architecture, 4 transformer layers, 8 attention heads, hidden dimension 512
  • Bilingual Models: Combining target language and English data (200M words total)
  • Multilingual Model: 12 layers, hidden dimension 768, vocabulary size 32,768, 111M parameters
  • Vocabulary Size: 8,192 (monolingual), 32,768 (multilingual)
  • Training Strategy: BPE tokenization, 10 epochs (monolingual), 5 epochs (bilingual), 1 epoch (multilingual)

Evaluation Framework

Formal Linguistic Ability

  • MonoBLiMP: Language-specific minimal contrast benchmarks
  • MultiBLiMP: Large-scale minimal contrast dataset based on Universal Dependencies
  • CLAMS: Cross-lingual subject-verb agreement benchmark

Functional Linguistic Ability

  • Knowledge-Intensive Tasks: Global-MMLU, INCLUDE, BM-LAMA
  • Reasoning Tasks: XNLI, HellaSwag, Belebele, ARC, XCOPA, etc.

Evaluation Methods

  • Zero-Shot Evaluation: Minimal contrast comparison based on model output probabilities
  • Fine-Tuning Evaluation: Classification and question-answering tasks, up to 8,000 training samples, 10 epochs

Comparison Methods

  • Baseline Models: Random performance
  • Comparative Models: Qwen3-0.6B (appropriately-scaled multilingual model)
  • Architecture Comparison: GPT-BERT vs GPT-2

Experimental Results

Main Results

Monolingual Model Performance

  • MultiBLiMP Tasks: Tier 1 languages typically exceed 80% accuracy, demonstrating strong grammatical learning ability
  • Other Benchmarks: Most task performance approaches random levels, reflecting data scale limitations
  • Data Scale Impact: Tier 1 > Tier 2 > Tier 3, demonstrating the importance of data quantity on performance

Multilingual vs Monolingual Comparison

  • MultiBLiMP: Monolingual models typically outperform multilingual models, except for 4 Tier 3 languages
  • Belebele: Both model types approach random performance, while Qwen performs significantly better
  • Overall Trend: Qwen surpasses the proposed models on most tasks, but the multilingual model outperforms on 8 languages

Bilingual Model Effects

  • Knowledge-Intensive Tasks: SIB-200, BM-LAMA, XCOMPS, INCLUDE show consistent performance improvements
  • Grammatical Tasks: MultiBLiMP performance remains essentially unchanged, indicating syntactic ability is less sensitive to bilingual input
  • Special Cases: Dutch shows slight decline on INCLUDE task, possibly due to domain mismatch

Ablation Studies

Architecture Comparison (GPT-2 vs GPT-BERT)

  • GPT-2 models consistently outperform GPT-BERT on SIB-200 and MultiBLiMP tasks
  • Results indicate GPT-2 architecture is better suited for small-scale data training in the current configuration

Language Coverage Analysis

  • Tier 1 Languages: Chinese, French, Bulgarian, etc., with relatively abundant developmentally plausible data
  • Tier 2 Languages: Japanese, Serbian, Cantonese, etc., with moderate data quantities
  • Tier 3 Languages: Mostly low-resource languages, primarily relying on multilingual resource padding

BabyLM Challenge

  • First Edition: 10M and 100M word English corpus, 39% developmentally plausible data
  • Second Edition: Increased to 70% child-directed data
  • Evaluation Methods: Zero-shot minimal contrast and fine-tuning evaluation

Multilingual Extension Efforts

  • Salhan et al. (2024): Curriculum learning for French, German, Japanese, and Chinese acquisition
  • Prévot et al. (2024): Spontaneous speech corpus research for English and French
  • Matzopoulos et al. (2025): BabyLM research for isiXhosa, highlighting low-resource language challenges

Existing Multilingual Resources

  • CHILDES: Child-adult interaction database for 40+ languages
  • MAO-CHILDES: Age-ordered dataset for 5 languages
  • IPA-CHILDES: Phonemicized corpus for 31 languages

Conclusions and Discussion

Main Conclusions

  1. Feasibility Validation: Successfully constructed developmentally plausible datasets for 45 languages, demonstrating the feasibility of multilingual BabyLM research
  2. Data Quantity Impact: More developmentally plausible data indeed enhances grammatical learning ability, particularly on MultiBLiMP tasks
  3. Bilingual Benefits: Consistent performance improvements on knowledge-intensive tasks with bilingual training
  4. Architecture Selection: GPT-2 architecture outperforms GPT-BERT under small-scale data settings

Limitations

  1. Uneven Language Coverage: Despite covering 45 languages, African languages and minority languages remain underrepresented
  2. Data Composition Variance: Significant differences in developmental plausibility ratios across languages may affect cross-linguistic comparisons
  3. Evaluation Resource Constraints: Lack of standardized evaluation benchmarks covering all languages
  4. Data Approximation: Datasets represent only rough approximations of actual child language input

Future Directions

  1. Expand Language Coverage: Particularly African languages and other low-resource languages
  2. Improve Data Quality: Collect more high-quality child-directed speech data
  3. Standardize Evaluation: Develop cross-linguistically consistent evaluation frameworks
  4. Multilingual Ability Research: Investigate bilingual and multilingual acquisition mechanisms in depth

In-Depth Evaluation

Strengths

  1. Systematic Contribution: First systematic construction of large-scale multilingual developmentally plausible datasets
  2. Community-Oriented: Established sustainable community-driven data collection framework
  3. Methodological Rigor: Employed byte-equivalent calibration ensuring cross-linguistic data quantity comparability
  4. Strong Openness: Complete release of data, code, and models promoting reproducible research
  5. High Practical Value: Provides important resources for multilingual cognitive modeling and data efficiency research

Limitations

  1. Inconsistent Data Quality: Significant variation in developmental plausibility ratios across languages
  2. Limited Model Performance: Baseline models approach random performance on most tasks
  3. Incomplete Evaluation Coverage: Some languages lack sufficient evaluation benchmarks
  4. Insufficient Theoretical Analysis: Lacks in-depth analysis of why certain languages or tasks perform better

Impact

  1. Field Contribution: Fills gap in multilingual developmentally plausible datasets, advancing related research
  2. Practical Value: Provides important starting point for low-resource language model research
  3. Reproducibility: Complete open-source resources ensure research reproducibility and scalability
  4. Community Building: Establishes sustainable collaborative framework promoting long-term development

Applicable Scenarios

  1. Cognitive Linguistics Research: Exploring relationships between human language acquisition and machine learning
  2. Low-Resource Language Modeling: Providing training starting points for resource-scarce languages
  3. Multilingual Education: Supporting bilingual and multilingual learning research
  4. Data Efficiency Research: Investigating model training strategies under limited data budgets

Technical Innovation Points

Data Collection Innovation

  1. Byte-Equivalent Calibration: Adjusting data quantities across languages using UTF-8 encoding size, ensuring fair comparison
  2. Hierarchical Data Organization: Stratifying languages into three tiers based on available data, balancing coverage and quality
  3. Community-Driven Quality Control: Each language managed by native or fluent speakers, ensuring cultural and linguistic appropriateness

Evaluation Framework Innovation

  1. Dual-Mode Evaluation: Combining zero-shot and fine-tuning evaluation for comprehensive ability assessment
  2. Cross-Linguistic Consistency: Using tools like MultiBLiMP to ensure cross-linguistic evaluation comparability
  3. Capability-Stratified Evaluation: Distinguishing between formal and functional linguistic ability assessment

Open Science Practices

  1. Complete Resource Release: Data, code, and models all open-sourced
  2. Scalable Design: Standardized pipeline supporting community contributions
  3. Transparent Documentation: Detailed information on data sources, licenses, and preprocessing

This work makes important contributions to the intersection of multilingual language model research and cognitive linguistics, establishing a sustainable research platform with potential to advance understanding of human language acquisition mechanisms.