2025-11-17T03:40:13.370820

Studies with impossible languages falsify LMs as models of human language

Bowers, Mitchell
According to Futrell and Mahowald [arXiv:2501.17047], both infants and language models (LMs) find attested languages easier to learn than impossible languages that have unnatural structures. We review the literature and show that LMs often learn attested and many impossible languages equally well. Difficult to learn impossible languages are simply more complex (or random). LMs are missing human inductive biases that support language acquisition.
academic

Studies with impossible languages falsify LMs as models of human language

Basic Information

  • Paper ID: 2511.11389
  • Title: Studies with impossible languages falsify LMs as models of human language
  • Authors: Jeffrey S. Bowers (University of Bristol), Jeff Mitchell (University of Sussex)
  • Classification: cs.CL (Computational Linguistics)
  • Paper Type: Commentary on Futrell & Mahowald (in press), Behavioural and Brain Sciences
  • Paper Link: https://arxiv.org/abs/2511.11389

Abstract

This paper is a commentary on Futrell and Mahowald's (F&M) work on language models and human language learning. F&M claim that both infants and language models (LMs) find natural languages easier to learn than "impossible languages" with non-natural structures. Through a literature review, the authors demonstrate that LMs frequently learn both natural languages and many impossible languages with equal ease. Those impossible languages that are difficult to learn are merely more complex or random. The authors argue that LMs lack the inductive biases that support human language acquisition.

Research Background and Motivation

Core Problem

This paper focuses on a fundamental theoretical question: Are language models (LMs) appropriate models of human language acquisition?

Importance of the Problem

  1. The speed puzzle of language acquisition: Infants learn language at a remarkable pace, which is a core challenge for language acquisition models
  2. Focus of theoretical debate: Chomsky's Universal Grammar (UG) theory posits that humans possess innate linguistic inductive biases that not only constrain the structure of all natural languages but also enable children to learn rapidly
  3. Challenge from LMs: Large language models such as ChatGPT lack human-like prior knowledge yet demonstrate excellent performance on various language tasks, challenging traditional linguistic theories

Limitations of Existing Approaches

  1. F&M's perspective: Claims that LMs, like humans, find natural languages easier to learn than impossible languages, suggesting that LMs possess inductive biases aligned with human language
  2. Chomsky's critique: Argues that LMs' ability to learn both human-possible and impossible languages with equal ease represents their deepest flaw as models of human language
  3. Divergence in literature interpretation: Different interpretations of the same studies lead to opposite conclusions

Research Motivation

The authors aim to systematically review the literature to clarify empirical evidence regarding LMs' ability to learn impossible languages, challenge F&M's position, and support Chomsky's claim that LMs lack human language inductive biases.

Core Contributions

  1. Systematic literature review: Comprehensive review and reanalysis of recent studies on LMs learning impossible languages
  2. Clarification of empirical evidence: Reveals F&M's misreading of existing research, showing that LMs can actually learn many impossible languages easily
  3. Theoretical analysis: Distinguishes between "difficult to learn" and "structurally complex/random," arguing that difficult-to-learn impossible languages are merely more complex or random
  4. Support for Chomsky's thesis: Provides evidence that LMs lack the language-learning inductive biases unique to humans
  5. Methodological critique: Points out the applicability of the "no free lunch theorem," arguing that poor LM performance on certain languages is unsurprising

Methodology Details

Task Definition

Rather than proposing new methods, this paper conducts a critical literature review. The core task is:

  • Input: Recent empirical studies on LMs learning impossible languages
  • Output: Systematic reinterpretation and theoretical analysis of these studies
  • Goal: Assess whether LMs truly struggle with impossible languages as humans do

Analytical Framework

1. Definition of Impossible Languages

  • Attested Languages: Natural languages actually used by humans
  • Impossible Languages: Artificially constructed languages violating Universal Grammar constraints, such as languages with completely reversed word order
  • Complex/Random Languages: Languages lacking structure or containing multiple arbitrary rules

2. Evaluation Criteria

The authors employ the following criteria to assess LM learning performance:

  • Learning speed: Amount of training data required for LMs to learn different language types
  • Final performance: Ultimate performance of LMs on different languages
  • Comparative analysis: Natural languages vs. impossible languages vs. random/complex languages

3. Theoretical Framework

  • Chomsky's prediction: If LMs lack UG, they should learn impossible languages as easily as natural languages
  • F&M's counterargument: Claims that LMs exhibit learning preferences consistent with humans
  • "No free lunch theorem": Any learning algorithm that performs well on one class of data must perform poorly on other data
  • Key distinction: Poor performance on certain languages ≠ possession of human-like inductive biases

Technical Innovation

The paper's innovation lies not in technical methods but in theoretical analysis depth:

  1. Language type differentiation: Clear distinction between "languages violating UG" and "random/complex languages"
  2. Reinterpretation of empirical results: Points out how F&M conflates language complexity with language possibility
  3. Theoretical consistency verification: Uses the "no free lunch theorem" to argue that LM poor performance on certain languages is inevitable, not evidence supporting human-like inductive biases

Experimental Setup

Rather than presenting new experiments, this paper reanalyzes previously published research. The authors systematically review the following studies:

Reviewed Studies

1. Kallini et al. (2024)

  • Experimental design: Tests LMs' ability to learn English and various impossible languages
  • F&M's interpretation: LMs learn natural English text consistently faster than baseline impossible languages
  • Authors' reanalysis:
    • While two difficult-to-learn impossible languages were reported, many impossible languages were nearly as easy to learn as English
    • Including an impossible language designed by Mitchell & Bowers (2020)
    • The most difficult-to-learn impossible language is random word order scrambling (no structure to learn)
    • Another difficult language is deterministic random scrambling (different scrambling rules for different sentence lengths, equivalent to learning multiple random languages)

2. Yang et al. (2025)

  • Experimental design: Evaluates LM performance on various impossible languages, including deterministic scrambling languages
  • Findings: Many impossible languages are easy to learn; random scrambling languages are difficult
  • Authors' critique: The authors incorrectly believe Chomsky predicts LMs should learn random scrambling languages, but learning multiple different random languages (for different sentence lengths) is difficult under any theory

3. Xu et al. (2025)

  • Experimental design: Varies language plausibility rather than impossibility
  • Findings: LMs struggle with certain implausible languages but easily learn others
  • Authors' observation: The researchers themselves acknowledge potential errors in material construction, leading to increased noise in counterfactual corpora

4. Ziv et al. (2025)

  • Findings: Reports multiple impossible languages that LMs learn easily, including partially reversed languages (replicating Mitchell & Bowers, 2020 results)

5. Lou et al. (2024) (not cited by F&M)

  • Findings: LMs can easily learn completely reversed languages

Data Summary

StudyEasy-to-learn Impossible LanguagesDifficult Language TypesKey Issue
Kallini et al.Multiple, including MB2020 languageRandom scrambling, deterministic multiple scramblingDifficult languages are random/complex
Yang et al.MultipleDeterministic multiple scramblingConflates complexity with impossibility
Xu et al.Partial implausible languagesPartial implausible languagesMaterial construction may be flawed
Ziv et al.Partial reversed languages, etc.-Supports Chomsky's view
Lou et al.Completely reversed languages-Supports Chomsky's view

Experimental Results

Main Findings

1. LMs frequently learn impossible languages easily

  • Impossible languages designed by Mitchell & Bowers (2020) are confirmed to be easy to learn
  • Partially reversed languages (Ziv et al., 2025) are easy to learn
  • Completely reversed languages (Lou et al., 2024) are easy to learn
  • Both Kallini et al. and Yang et al. report multiple easy-to-learn impossible languages

2. Difficult-to-learn "impossible languages" are actually complex/random languages

  • Complete random scrambling: No structure to learn
  • Deterministic multiple scrambling: Requires learning multiple different random mapping rules (one per sentence length)
  • These languages' difficulty stems from complexity and randomness, not UG violation

3. Massive differences in data efficiency

The authors cite Bowers (2025a) noting that:

  • LMs require several orders of magnitude more training data than infants
  • This is consistent with lacking human inductive biases

4. Limited effectiveness of UG induction attempts

McCoy & Griffiths (2025) attempted to distill Bayesian priors into LMs:

  • Failed to significantly improve data efficiency (Bowers, 2025b)

Theoretical Analysis

Application of "No Free Lunch Theorem"

The authors apply Wolpert & Macready (2002)'s theory:

  • Core principle: Learning algorithms that perform well on one data class must perform poorly on others
  • Implication: LM poor performance on certain languages (e.g., random scrambling) is inevitable, requiring no empirical confirmation
  • Key distinction: Poor performance on certain languages ≠ possession of human-like inductive biases
  • Falsification logic: Successfully learning certain impossible languages falsifies LMs as appropriate models of human language learning

Case Studies

Case 1: Kallini et al.'s Deterministic Scrambling Language

Original sentence (length 5): The cat sat on mat
Scrambling rule 1 (length 5): cat The on sat mat
Original sentence (length 6): The big cat sat on mat
Scrambling rule 2 (length 6): big The sat cat mat on

Analysis: Learning this language is equivalent to learning multiple different random mappings, with complexity increasing linearly with sentence length variety. This tests memory for arbitrary mappings, not UG biases.

Case 2: Mitchell & Bowers (2020)'s Partially Reversed Language

Certain word order rules are systematically reversed while maintaining consistency. Finding: LMs learn this easily, indicating they lack inductive biases excluding such languages.

Language Acquisition Theory

1. Chomsky's Universal Grammar (UG)

  • Humans possess innate, language-specific inductive biases
  • UG constraints restrict possible human language structures
  • Enable children to learn languages rapidly

2. Statistical Learning Theory

  • Emphasizes extracting statistical regularities from input data
  • LMs represent the pinnacle of statistical learning

Comparative Studies of LMs and Human Language Learning

Research Supporting LMs

  • F&M and cited studies claim LMs exhibit human-like learning preferences

Research Criticizing LMs

  • Mitchell & Bowers (2020): First systematic demonstration that LMs learn impossible languages
  • Chomsky et al. (2023): Criticizes LMs' inability to distinguish possible from impossible languages
  • Bowers (2025a): Emphasizes LMs' data efficiency far below humans

This Paper's Position

This paper adopts the Chomsky tradition in theoretical linguistics, refuting recent arguments from the connectionist/statistical learning camp through reanalysis of empirical research.

Conclusions and Discussion

Main Conclusions

  1. Empirical evidence does not support F&M's view: LMs frequently learn natural languages and impossible languages with equal ease
  2. Difficult-to-learn "impossible languages" are complex/random: Learning difficulty stems from complexity rather than UG violation
  3. LMs lack human inductive biases: Combined evidence of easy-to-learn impossible languages and low data efficiency shows LMs' learning patterns fundamentally differ from humans
  4. "No free lunch" cannot serve as supporting evidence: LM poor performance on certain languages is inevitable, not proof of human-like biases
  5. LMs are not appropriate models of human language acquisition: Current LMs' learning approach is precisely what we expect from systems lacking human innate linguistic biases

Limitations

Paper's Own Limitations

  1. No new empirical data: Based solely on literature review without new experiments
  2. Vague definition of impossible languages: Different studies operationalize "impossible languages" inconsistently
  3. Insufficient mechanistic exploration: Lacks detailed analysis of why LMs can learn impossible languages
  4. Limited sample size: Relatively few reviewed studies (primarily five recent papers)

Field Limitations

  1. Ecological validity of impossible languages: Artificially constructed impossible languages may not fully capture UG constraints
  2. LM diversity: Different architectures may perform differently, but the paper insufficiently distinguishes them
  3. Measurement issues: How to accurately measure "learning difficulty" remains contested

Future Directions

Explicitly Proposed by the Paper

  1. Stricter impossible language design: More precise operationalization of UG violations needed
  2. Mechanistic research: Understanding internal representations and processes in LM learning of impossible languages

Implicitly Suggested Directions

  1. Cross-model comparison: Systematic comparison of inductive biases across different LM architectures
  2. Developmental trajectory research: Comparing learning curves between LMs and children
  3. Hybrid models: Exploring integration of linguistic priors into LMs
  4. Neuroscience verification: Brain imaging studies verifying human neural mechanisms for processing impossible languages

In-Depth Evaluation

Strengths

1. High theoretical clarity

  • Clear distinction between "complexity" and "impossibility," a crucial conceptual clarification
  • Correct application of "no free lunch theorem," revealing logical fallacies

2. Deep literature analysis

  • Goes beyond reading conclusions to deeply analyzing experimental designs and data
  • Identifies F&M's selective citation and misreading problems

3. Rigorous logical argumentation

  • Uses falsification logic: successfully learning impossible languages falsifies LMs as human models
  • Points out asymmetry in opponent's argument: difficulty on certain languages cannot prove human-like biases

4. Academic integrity

  • Acknowledges material problems identified by Xu et al. researchers themselves
  • Fairly presents all viewpoints

5. Significant theoretical implications

  • Touches on core linguistic debates: nature vs. nurture, UG vs. statistical learning
  • Offers insights for AI field: LMs' capability boundaries

Weaknesses

1. Weak empirical foundation

  • No new data: Entirely dependent on reinterpretation of others' research
  • Possible selectivity: While criticizing F&M's selective citation, own literature selection may also be biased
  • Lack of quantitative synthesis: No meta-analysis or systematic quantitative review

2. Insufficient concept operationalization

  • Vague "impossible language" definition: Different studies use different definitions, insufficiently discussed
  • Unclear "easy" vs. "difficult" standards: No explicit quantitative criteria provided
  • Unmeasured "complexity": How to quantify language complexity?

3. Argumentative limitations

  • Deterministic scrambling argument: While noting complexity, whether this complexity is entirely unrelated to UG violations remains debatable
  • "No free lunch" applicability: This theorem applies to optimization problems; direct application to language learning needs more justification
  • Unexplored alternative explanations: LMs may possess other types of inductive biases (e.g., locality preferences), just different from UG

4. Insufficient mechanistic exploration

  • Black-box analysis: Judges only from input-output, without analyzing internal LM representations
  • Lack of constructive solutions: More criticism than construction; no proposals for improving LMs

5. Somewhat polemical tone

  • Clear stance: Clearly sides with Chomsky, potentially affecting objectivity
  • Harsh criticism of opponents: Terms like "misreading" and "error" could be more diplomatically phrased in academic debate

6. Sample size and representativeness

  • Only five main papers reviewed: Relatively small sample
  • Narrow time window: Primarily 2020-2025 research
  • Homogeneous model types: Mainly focuses on Transformer-based LMs

Impact Assessment

Contribution to the Field

  1. Theoretical clarification: Important conceptual distinction (complexity vs. impossibility)
  2. Methodological contribution: Identifies common pitfalls in experimental design
  3. Advancing debate: Will promote more rigorous experimental design and deeper theoretical discussion

Potential Impact

  • Short-term: Likely to provoke responses from F&M and related researchers, advancing academic debate
  • Medium-term: Encourages researchers to design stricter impossible language experiments
  • Long-term: May influence assessment of LMs' role in cognitive science

Practical Value

  • For AI research: Understanding LM inductive biases has value for model improvement
  • For education: If LMs learn differently from humans, cannot directly simulate language teaching

Reproducibility

  • High: Paper is primarily a literature review; all cited research is published, allowing readers to verify the analysis

Applicable Scenarios

Suitable Audiences

  1. Theoretical linguists: Interested in UG and language acquisition theory
  2. Computational linguists: Studying LM capabilities and limitations
  3. Cognitive scientists: Concerned with computational models of human language processing
  4. AI researchers: Thinking about improving LM inductive biases

Applicable Research Scenarios

  1. Designing impossible language experiments: Provides important methodological guidance
  2. Evaluating LM cognitive plausibility: Offers theoretical framework
  3. Linguistic theory debates: Supports nativist positions

Inapplicable Scenarios

  1. Engineering applications: Limited help for practical LM applications
  2. Non-linguistic domains: Arguments specific to language learning

Key References

Core Debate Literature

  1. Chomsky et al. (2023): "The False Promise of ChatGPT" - Chomsky's classic critique of LMs
  2. Futrell & Mahowald (2025): Target paper being commented on, representing pro-LM perspective

Key Empirical Studies

  1. Mitchell & Bowers (2020): First systematic demonstration of LMs learning impossible languages
  2. Kallini et al. (2024): "Mission: Impossible language models" - One of the most comprehensive empirical studies
  3. Yang et al. (2025): Cross-linguistic impossible language learning research

Theoretical Foundations

  1. Wolpert & Macready (2002): "No free lunch theorems" - Foundational machine learning theory
  2. McCoy & Griffiths (2025): Research integrating Bayesian priors into LMs
  1. Bowers (2025a): Systematic analysis of LM data efficiency
  2. Bowers (2025b): Commentary on McCoy & Griffiths

Overall Evaluation

This is a theoretically well-positioned, logically rigorous, but empirically relatively weak commentary paper. Through deep analysis of existing literature, the authors powerfully challenge the view that "LMs possess human-like language inductive biases," supporting Chomsky's traditional linguistic position.

Greatest value lies in its conceptual clarification (distinguishing complexity from impossibility) and logical analysis (applying falsification logic and the "no free lunch theorem"), which contribute importantly to the field's methodology.

Main limitations include lack of new empirical data and insufficient analysis of LM internal mechanisms. For a commentary paper, this is understandable but limits persuasiveness.

This paper will promote deep discussion in linguistics and AI about LM nature, encouraging stricter experimental design, but may not immediately shift both camps' fundamental positions. Resolving this debate likely requires more empirical research, more precise theoretical frameworks, and possibly independent evidence from neuroscience.

Recommendation: ⭐⭐⭐⭐ (4/5)

  • Theoretical contribution: ⭐⭐⭐⭐⭐
  • Empirical sufficiency: ⭐⭐⭐
  • Methodological innovation: ⭐⭐⭐
  • Practical value: ⭐⭐⭐
  • Writing quality: ⭐⭐⭐⭐