Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.
Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures
- Paper ID: 2510.07141
- Title: Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures
- Authors: Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
- Classification: cs.CL cs.AI
- Publication Date: October 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.07141
Large language models (LLMs) have demonstrated fluent conversation capabilities with humans, but do they encounter sentence processing difficulties similar to those experienced by humans? This study systematically compares sentence comprehension abilities between humans and LLMs across seven challenging linguistic structures. The research collected sentence comprehension data from human subjects and five state-of-the-art LLM families, which vary in scale and training procedures. Results demonstrate that LLMs exhibit widespread processing difficulties on target structures, particularly on garden path (GP) sentences. While the strongest models achieve near-perfect accuracy on non-GP structures (GPT-5 reaches 93.7%), they struggle with GP structures (GPT-5 only 46.8%). Furthermore, when ranking structures by average performance, the rank correlation between humans and models increases with model parameter size.
With the breakthrough in conversational capabilities of large language models, a critical question emerges: Do LLMs encounter processing difficulties on specific linguistic structures similar to humans? This question is crucial for understanding the cognitive mechanisms of LLMs and their similarities to human language processing.
- Cognitive Science Implications: By comparing error patterns between humans and LLMs, insights into the language processing mechanisms of both can be gained
- Model Evaluation Needs: Traditional evaluations focus on overall performance while lacking fine-grained analysis of specific linguistic phenomena
- Application Value: Understanding LLM language processing limitations aids in improving model design and deployment
- Indirect Measurement: Most studies employ indirect metrics (e.g., reading time, perplexity) rather than direct comprehension tests
- Inconsistent Experimental Settings: Different studies use different models, data, and prompts, making unified conclusions difficult
- Limited Coverage: Lack of systematic comparison across multiple linguistic phenomena
- Constructed a sentence comprehension dataset with seven challenging linguistic structures, including four types of garden path sentences, double center embedding, similarity interference, and deep impact sentences
- Systematically tested 31 state-of-the-art models spanning 5 model families with varying scales and training approaches
- Discovered processing differences between GP and non-GP structures: LLMs show performance closer to humans on GP sentences but superior performance on non-GP structures
- Proposed the "sweet spot" principle: Only in moderately strong models can processing patterns similar to humans be observed in the target-baseline performance difference
Input: A sentence and a comprehension question
Output: Yes/No answer
Objective: Compare performance patterns between humans and LLMs on identical tasks
- Garden Path Sentences (4 types):
- Subject/Object GP: "While the man hunted the deer ran into the woods."
- NP/S GP: "The policeman saw the lights were off."
- NP/VP GP: "The complex houses married soldiers."
- Reduced relative GP: "The chef hired last month worked overtime."
- Double Center Embedding: Contains two nested clauses, e.g., "The man that the teacher that the student liked called sat."
- Deep Impact Sentences: Multiple negation structures, e.g., "No head injury is too trivial to be ignored."
- Similarity Interference: Two noun phrases sharing features causing interference, e.g., "The banker that the barber praised climbed the mountain."
Each structure includes both target conditions (containing difficult structures) and baseline conditions (with difficult factors removed), ensuring measurement of structure-specific effects.
- Participants: English native speakers recruited through Prolific platform
- Procedure: Word-by-word presentation (400ms/word), questions presented for 5 seconds
- Design: Each participant sees only one sentence-question pair, avoiding learning effects
- Sample Size: 5,380 data points, with 10 participants per sentence-question pair
- Prompting Strategy: Few-shot prompting with examples lacking target structures
- Control Variables: 2 system prompts × 4 example orderings = 8 repetitions
- Model Coverage: 31 models including GPT, Llama, Qwen, Gemma, and DeepSeek families
- Chain-of-Thought Testing: Partial models tested with/without "thinking" mode enabled
- Human Average Accuracy: 28.3%, validating structure difficulty
- Best LLM Performance: o3 model 74.5% (without chain-of-thought), GPT-5 88.9% (with chain-of-thought)
- Structure Differences: GP sentences are relatively more difficult for LLMs, contrasting with non-GP structures
| Model Type | GP Accuracy | Non-GP Accuracy | Difference |
|---|
| GPT-5 | 46.8% | 93.7% | 46.9% |
| o3 | 66.5% | 87.3% | 20.8% |
| Humans | 25.8% | 32.4% | 6.6% |
Absolute Performance Differences:
- GP structures: Average difference 0.173 (closer to humans)
- Deep impact: Average difference 0.328
- Double embedding: Average difference 0.330
- Similarity interference: Average difference 0.370
Rank Correlation: Increases with model scale; o4-mini achieves highest correlation of 0.929 with human structure difficulty rankings.
Models require moderate strength to replicate human target-baseline difference patterns:
- Too Weak: Poor performance on both conditions
- Too Strong: Good performance on both conditions
- Moderate: Shows directional differences similar to humans
- Strength Dependency: Only sufficiently strong models benefit from chain-of-thought
- Structure Specificity: Chain-of-thought provides greater benefits for non-GP structures, with limited effects on GP structures
- Exception Cases: GPT-5 shows significant improvement on GP structures with chain-of-thought
- Brain Activation Comparison: Schrimpf et al. compare brain and LLM activation patterns
- Cognitive Metric Prediction: Using LLM information to predict human reading times, eye movements, etc.
- Garden Path Effects: Amouyal et al. found human-like errors in LLMs on specific GP sentences
- Center Embedding: Hu et al. showed LLMs, like humans, consider center-embedded sentences ungrammatical
This study is the first to systematically compare multiple linguistic phenomena within a unified framework, overcoming inconsistent experimental settings in previous research.
- Specificity of GP Structures: LLMs show performance closer to humans on GP sentences, possibly because GP sentences require discarding incorrect interpretations rather than merely relying on working memory
- Scale Effects: Larger models show higher correlation with humans in structure difficulty rankings
- Sweet Spot Principle: Moderately strong models best replicate human processing patterns
Working Memory Hypothesis: LLMs outperform humans on structures requiring substantial working memory (e.g., double embedding), but perform relatively worse on GP sentences requiring discarding incorrect interpretations, as the latter is not a working memory capacity issue.
- Model Coverage: Only tested one closed-source model family from OpenAI; lacks Anthropic or Google models
- GP Type Limitations: Did not test all types of garden path sentences
- Single Metric: Only tested comprehension accuracy; lacks eye-tracking, reading time, and other cognitive metrics
- Causal Verification: Design experiments to verify the working memory hypothesis
- Extended Testing: Include more model families and GP types
- Multimodal Metrics: Incorporate multiple cognitive measurement indicators
- Rigorous Experimental Design: Systematic comparison within a unified framework with sufficient variable control
- Unprecedented Scale: Covering 31 models and 7 linguistic phenomena, the largest-scale study in this field
- Important Findings: The differential discovery between GP and non-GP structures holds significant theoretical value
- Methodological Innovation: Direct measurement of comprehension rather than indirect metrics, more reliable
- Limited Theoretical Explanation: Working memory hypothesis still requires more supporting evidence
- Language Limitations: Only tested English; lacks cross-linguistic validation
- Single Task: Uses only Yes/No questions, potentially unable to fully reflect comprehension abilities
- Academic Contribution: Provides new methodological framework for human-AI cognitive comparison research
- Practical Value: Helps understand LLM language processing limitations, guiding model improvements
- Reproducibility: Authors commit to open-sourcing code and data, facilitating subsequent research
- Model Evaluation: Provides fine-grained evaluation tools for LLM language comprehension
- Cognitive Research: Provides paradigm for comparing language processing mechanisms between artificial and natural intelligence
- Educational Applications: Can be used for identifying difficult structures in language learning and targeted training
- Amouyal et al. (2025). When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models.
- Christianson et al. (2001). Thematic roles assigned along the garden path linger.
- Gibson & Thomas (1999). Memory limitations and structural forgetting.
- Gordon et al. (2001). Memory interference during language processing.
Overall Assessment: This is a high-quality interdisciplinary study with methodological innovation, rigorous experimental design, and findings of significant theoretical and practical importance. Particularly, the discovery of differential processing between GP and non-GP structures provides new perspectives for understanding LLM cognitive mechanisms. Despite some limitations, the overall contribution is substantial and merits further in-depth investigation.