2025-11-10T02:30:45.577405

Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures

Amouyal, Meltzer-Asscher, Berant

Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

academic

Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures

Basic Information

Paper ID: 2510.07141
Title: Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures
Authors: Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
Classification: cs.CL cs.AI
Publication Date: October 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.07141

Abstract

Large language models (LLMs) have demonstrated fluent conversation capabilities with humans, but do they encounter sentence processing difficulties similar to those experienced by humans? This study systematically compares sentence comprehension abilities between humans and LLMs across seven challenging linguistic structures. The research collected sentence comprehension data from human subjects and five state-of-the-art LLM families, which vary in scale and training procedures. Results demonstrate that LLMs exhibit widespread processing difficulties on target structures, particularly on garden path (GP) sentences. While the strongest models achieve near-perfect accuracy on non-GP structures (GPT-5 reaches 93.7%), they struggle with GP structures (GPT-5 only 46.8%). Furthermore, when ranking structures by average performance, the rank correlation between humans and models increases with model parameter size.

Research Background and Motivation

Problem Definition

With the breakthrough in conversational capabilities of large language models, a critical question emerges: Do LLMs encounter processing difficulties on specific linguistic structures similar to humans? This question is crucial for understanding the cognitive mechanisms of LLMs and their similarities to human language processing.

Research Significance

Cognitive Science Implications: By comparing error patterns between humans and LLMs, insights into the language processing mechanisms of both can be gained
Model Evaluation Needs: Traditional evaluations focus on overall performance while lacking fine-grained analysis of specific linguistic phenomena
Application Value: Understanding LLM language processing limitations aids in improving model design and deployment

Limitations of Existing Research

Indirect Measurement: Most studies employ indirect metrics (e.g., reading time, perplexity) rather than direct comprehension tests
Inconsistent Experimental Settings: Different studies use different models, data, and prompts, making unified conclusions difficult
Limited Coverage: Lack of systematic comparison across multiple linguistic phenomena

Core Contributions

Constructed a sentence comprehension dataset with seven challenging linguistic structures, including four types of garden path sentences, double center embedding, similarity interference, and deep impact sentences
Systematically tested 31 state-of-the-art models spanning 5 model families with varying scales and training approaches
Discovered processing differences between GP and non-GP structures: LLMs show performance closer to humans on GP sentences but superior performance on non-GP structures
Proposed the "sweet spot" principle: Only in moderately strong models can processing patterns similar to humans be observed in the target-baseline performance difference

Methodology

Task Definition

Input: A sentence and a comprehension question Output: Yes/No answer Objective: Compare performance patterns between humans and LLMs on identical tasks

Experimental Structure Design

Seven Linguistic Structures

Garden Path Sentences (4 types):
- Subject/Object GP: "While the man hunted the deer ran into the woods."
- NP/S GP: "The policeman saw the lights were off."
- NP/VP GP: "The complex houses married soldiers."
- Reduced relative GP: "The chef hired last month worked overtime."
Double Center Embedding: Contains two nested clauses, e.g., "The man that the teacher that the student liked called sat."
Deep Impact Sentences: Multiple negation structures, e.g., "No head injury is too trivial to be ignored."
Similarity Interference: Two noun phrases sharing features causing interference, e.g., "The banker that the barber praised climbed the mountain."

Control Design

Each structure includes both target conditions (containing difficult structures) and baseline conditions (with difficult factors removed), ensuring measurement of structure-specific effects.

Experimental Procedure

Human Experiments

Participants: English native speakers recruited through Prolific platform
Procedure: Word-by-word presentation (400ms/word), questions presented for 5 seconds
Design: Each participant sees only one sentence-question pair, avoiding learning effects
Sample Size: 5,380 data points, with 10 participants per sentence-question pair

LLM Experiments

Prompting Strategy: Few-shot prompting with examples lacking target structures
Control Variables: 2 system prompts × 4 example orderings = 8 repetitions
Model Coverage: 31 models including GPT, Llama, Qwen, Gemma, and DeepSeek families
Chain-of-Thought Testing: Partial models tested with/without "thinking" mode enabled

Experimental Results

Main Findings

1. Overall Performance Patterns

Human Average Accuracy: 28.3%, validating structure difficulty
Best LLM Performance: o3 model 74.5% (without chain-of-thought), GPT-5 88.9% (with chain-of-thought)
Structure Differences: GP sentences are relatively more difficult for LLMs, contrasting with non-GP structures

2. Key Differences Between GP and Non-GP Structures

Model Type	GP Accuracy	Non-GP Accuracy	Difference
GPT-5	46.8%	93.7%	46.9%
o3	66.5%	87.3%	20.8%
Humans	25.8%	32.4%	6.6%

3. Human Similarity Analysis

Absolute Performance Differences:

GP structures: Average difference 0.173 (closer to humans)
Deep impact: Average difference 0.328
Double embedding: Average difference 0.330
Similarity interference: Average difference 0.370

Rank Correlation: Increases with model scale; o4-mini achieves highest correlation of 0.929 with human structure difficulty rankings.

4. "Sweet Spot" Phenomenon

Models require moderate strength to replicate human target-baseline difference patterns:

Too Weak: Poor performance on both conditions
Too Strong: Good performance on both conditions
Moderate: Shows directional differences similar to humans

Chain-of-Thought Effects

Strength Dependency: Only sufficiently strong models benefit from chain-of-thought
Structure Specificity: Chain-of-thought provides greater benefits for non-GP structures, with limited effects on GP structures
Exception Cases: GPT-5 shows significant improvement on GP structures with chain-of-thought

Neurolinguistic Research

Brain Activation Comparison: Schrimpf et al. compare brain and LLM activation patterns
Cognitive Metric Prediction: Using LLM information to predict human reading times, eye movements, etc.

Syntactic Processing Research

Garden Path Effects: Amouyal et al. found human-like errors in LLMs on specific GP sentences
Center Embedding: Hu et al. showed LLMs, like humans, consider center-embedded sentences ungrammatical

Methodological Contributions

This study is the first to systematically compare multiple linguistic phenomena within a unified framework, overcoming inconsistent experimental settings in previous research.

Conclusions and Discussion

Main Conclusions

Specificity of GP Structures: LLMs show performance closer to humans on GP sentences, possibly because GP sentences require discarding incorrect interpretations rather than merely relying on working memory
Scale Effects: Larger models show higher correlation with humans in structure difficulty rankings
Sweet Spot Principle: Moderately strong models best replicate human processing patterns

Theoretical Explanation

Working Memory Hypothesis: LLMs outperform humans on structures requiring substantial working memory (e.g., double embedding), but perform relatively worse on GP sentences requiring discarding incorrect interpretations, as the latter is not a working memory capacity issue.

Limitations

Model Coverage: Only tested one closed-source model family from OpenAI; lacks Anthropic or Google models
GP Type Limitations: Did not test all types of garden path sentences
Single Metric: Only tested comprehension accuracy; lacks eye-tracking, reading time, and other cognitive metrics

Future Directions

Causal Verification: Design experiments to verify the working memory hypothesis
Extended Testing: Include more model families and GP types
Multimodal Metrics: Incorporate multiple cognitive measurement indicators

In-Depth Evaluation

Strengths

Rigorous Experimental Design: Systematic comparison within a unified framework with sufficient variable control
Unprecedented Scale: Covering 31 models and 7 linguistic phenomena, the largest-scale study in this field
Important Findings: The differential discovery between GP and non-GP structures holds significant theoretical value
Methodological Innovation: Direct measurement of comprehension rather than indirect metrics, more reliable

Weaknesses

Limited Theoretical Explanation: Working memory hypothesis still requires more supporting evidence
Language Limitations: Only tested English; lacks cross-linguistic validation
Single Task: Uses only Yes/No questions, potentially unable to fully reflect comprehension abilities

Impact

Academic Contribution: Provides new methodological framework for human-AI cognitive comparison research
Practical Value: Helps understand LLM language processing limitations, guiding model improvements
Reproducibility: Authors commit to open-sourcing code and data, facilitating subsequent research

Applicable Scenarios

Model Evaluation: Provides fine-grained evaluation tools for LLM language comprehension
Cognitive Research: Provides paradigm for comparing language processing mechanisms between artificial and natural intelligence
Educational Applications: Can be used for identifying difficult structures in language learning and targeted training

References

Amouyal et al. (2025). When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models.
Christianson et al. (2001). Thematic roles assigned along the garden path linger.
Gibson & Thomas (1999). Memory limitations and structural forgetting.
Gordon et al. (2001). Memory interference during language processing.

Overall Assessment: This is a high-quality interdisciplinary study with methodological innovation, rigorous experimental design, and findings of significant theoretical and practical importance. Particularly, the discovery of differential processing between GP and non-GP structures provides new perspectives for understanding LLM cognitive mechanisms. Despite some limitations, the overall contribution is substantial and merits further in-depth investigation.