2025-11-24T22:34:17.172236

Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

Bruns

Humans understand new combinations of words encountered if they are combinations of words recognized from different contexts, an ability called Compositional Generalization. The COGS benchmark (Kim and Linzen, 2020) arXiv:2010.05465 reports 0% accuracy for Transformer models on some structural generalizations. We use (Weiss et al., 2021) arXiv:2106.06981's Restricted Access Sequence Processing (RASP), a Transformer-equivalent programming language, to demonstrate that a Transformer Encoder-Decoder can perform COGS and the semantically equivalent ReCOGS_pos (Wu et al., 2024) arXiv:2303.13716 systematically and compositionally: Our RASP models attain near perfect scores on structural generalization splits on COGS (exact match) and ReCOGS_pos (semantic exact match). Our RASP models show the (Re)COGS tasks do not require a hierarchical or tree-structured solution (contrary to (Kim and Linzen, 2020) arXiv:2010.05465, (Yao and Koller, 2022) arXiv:2210.13050, (Murty et al., 2022) arXiv:2211.01288, (Liu et al., 2021) arXiv:2107.06516): we use word-level tokens with an "embedding" layer that tags with possible part of speech, applying just once per encoder pass 19 attention-head compatible flat pattern-matching rules (easily identified with specific training examples), shown using grammar coverage (Zeller et al., 2023) to cover the non-recursive aspects of the input grammar, plus masking out prepositional phrases ("pp noun") and/or sentential complements (cp) when recognizing grammar patterns and extracting nouns related to the main verb in the sentence, and output the next logical form (LF) token (repeating until the LF is complete). The models do not apply recursive, tree-structured rules like "np_det pp np -> np_pp -> np", but score near perfect semantic and string exact match on both COGS and ReCOGS pp recursion, cp recursion using the decoder loop.

academic

Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

Basic Information

Paper ID: 2504.15349
Title: Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)
Author: William Bruns
Category: cs.CL (Computational Linguistics)
Publication Date: October 14, 2025 (arXiv v3)
Paper Link: https://arxiv.org/abs/2504.15349v3

Abstract

Humans possess the ability to understand novel combinations of words identified in different contexts, a capability termed compositional generalization. The COGS benchmark reports that Transformer models achieve 0% accuracy on certain structural generalization tasks. This paper uses RASP (Restricted Access Sequence Processing) language to demonstrate that Transformer encoder-decoder architectures can systematically and compositionally execute COGS and semantically equivalent ReCOGS_pos tasks: the RASP model achieves near-perfect scores on structural generalization splits. The research reveals that (Re)COGS tasks do not require hierarchical or tree-structured solutions; instead, they employ 19 attention-head-compatible flat pattern-matching rules that identify grammatical patterns through masking prepositional phrases and clauses.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is the limitation of Transformer models' capabilities on compositional generalization tasks, particularly their poor performance on the COGS (Compositional Generalization Challenge based on Semantic Interpretation) benchmark.

Significance

Theoretical Significance: Compositional generalization is a core capability of human language understanding. Understanding how neural networks implement this capability is crucial for advancing AI's language comprehension.
Practical Significance: The near-0% accuracy of current Transformer models on structural generalization tasks indicates fundamental limitations that require solutions.

Limitations of Existing Approaches

Shallow Network Constraints: The 2-layer Encoder-Decoder used by Kim and Linzen (2020) performs extremely poorly on structural generalization.
Incorrect Hierarchical Assumptions: Existing research assumes that tree structures or hierarchical representations are necessary to solve COGS tasks.
Ineffectiveness of Depth: Petty et al. (2024) found that even increasing to 32 layers provides no improvement for Transformers on COGS structural generalization.

Research Motivation

The author was inspired by Zhou et al. (2023), who used RASP to analyze Transformer generalization capabilities. The goal is to demonstrate through constructive proof that Transformers can theoretically solve COGS tasks and analyze why existing models fail.

Core Contributions

Constructive Proof: Uses RASP language to prove that Transformer Encoder-Decoder architectures can theoretically solve COGS and ReCOGS_pos tasks systematically.
Flat Solution: Proposes a non-hierarchical solution based on 19 flat pattern-matching rules without requiring recursive tree-structured rules.
Error Analysis: Predicts and validates specific error patterns in baseline Transformers through "attraction error" theory.
Performance Breakthrough: RASP model achieves 99.89% string exact match on COGS and 99.63% semantic exact match on ReCOGS_pos.
New Generalization Split: Discovers and validates a new challenging generalization split "v_dat_p2_pp_moved_to_recipient".

Methodology Details

Task Definition

COGS/ReCOGS tasks require converting sentences with simplified English grammar into logical forms (LF):

Input: English sentence (e.g., "A scientist lended a cat a donut")
Output: Logical form (e.g., "scientist(1); cat(4); donut(6); lend(2) AND agent(2,1) AND recipient(2,4) AND theme(2,6)")
Evaluation: String exact match (COGS) or semantic exact match (ReCOGS)

Model Architecture

RASP Programming Framework

RASP is a programming language compilable into Transformer weights. This paper uses it to construct an Encoder-Decoder model:

Embedding Layer: Maps word-level tokens to part-of-speech and verb-type labels
Encoder: Uses 19 attention-head-compatible flat pattern matchers
Decoder Loop: Autoregressively generates logical form tokens

Core Component Design

1. Part-of-Speech Embedding Mapping

Vocabulary → {det: 1, common_noun: 7, proper_noun: 8, v_dat: 18, ...}

2. Flat Pattern Matchers 19 patterns cover all non-recursive grammatical rules, such as:

np v_dat_p2 np np (e.g., "Liam forwarded the girl the donut")
np was v_trans_omissible_pp_p2 by np (passive voice)

3. Masking Mechanism Key innovation: Masks prepositional phrase nouns when extracting noun-verb relationships:

no_pp_np_mask = 1 - aggregate((pp_one_after_mask and np_prop_diag_mask) or 
                              (pp_two_after_mask and np_det_diag_mask), 1)

Technical Innovations

1. Non-Recursive Solution

Unlike traditional assumptions, the model does not use recursive rules such as np_det pp np → np_pp → np. Instead, it:

Identifies primary grammatical patterns in the encoder
Unfolds recursive structures in the decoder

2. Attraction Error Avoidance

Prevents nouns in prepositional phrases from "attracting" incorrect grammatical relationships through masking:

Error: The cake on the plate burned → theme(burn, plate)  # Attraction error
Correct: The cake on the plate burned → theme(burn, cake)   # After masking

3. Decoder Loop Unfolding

Recursive structures are processed through decoder loops, supporting arbitrary depth of prepositional phrase and clause nesting.

Experimental Setup

Datasets

COGS: 24,155 training examples, 3,000 test examples, 21,000 generalization examples
ReCOGS_pos: ReCOGS version using position indices, semantically equivalent but allowing semantic exact match
Grammar Coverage: Validates that 19 rules cover 100% of non-recursive grammar using Zeller et al. (2023) methodology

Evaluation Metrics

String Exact Match: Identical logical form strings
Semantic Exact Match: Semantically equivalent logical forms with potentially different indices and ordering
Grammar Coverage Rate: Proportion of total grammar supported by the model

Comparison Methods

Wu et al. (2024) Baseline: 2-layer Encoder-Decoder Transformer
Layer Variants: 3-layer and 4-layer versions
Data-Augmented Versions: Versions with added prepositional phrase modification examples

Implementation Details

Uses official RASP interpreter for program evaluation
Vocabulary mapping based on all words in COGS training set
Deterministic programs use Clopper-Pearson confidence intervals

Experimental Results

Main Results

RASP Model Performance

COGS (String Exact Match)

Test set: 99.97% (99.81-99.99%)
obj_pp_to_subj_pp: 100.00% (99.63-100.00%)
pp_recursion: 98.40% (97.41-99.08%)
cp_recursion: 99.90% (99.44-99.997%)
Overall generalization: 99.89% (99.83-99.93%)

ReCOGS_pos (Semantic Exact Match)

Test set: 100.00% (99.88-100.00%)
obj_pp_to_subj_pp: 92.20% (90.36-93.79%)
pp_recursion: 100.00% (99.63-100.00%)
cp_recursion: 100.00% (99.63-100.00%)
Overall generalization: 99.63% (99.54-99.71%)

Baseline Transformer Performance Comparison

Wu et al. (2024) Baseline (ReCOGS_pos)

pp_recursion: 40.2% ± 9.3%
cp_recursion: 52.4% ± 1.4%
obj_pp_to_subj_pp: 19.7% ± 6.1%

Attraction Error Analysis

Error analysis of baseline Transformer validates theoretical predictions:

96.73% of single-relation errors conform to attraction error patterns
100% of depth-2 prepositional phrase errors point to the nearest prepositional noun
Confirms non-hierarchical linear processing hypothesis

New Generalization Split Validation

"v_dat_p2_pp_moved_to_recipient" split:

Baseline performance: 13% ± 15.6% (comparable to most difficult splits)
Supports flat processing hypothesis over tree-structured hypothesis

Ineffectiveness of Increased Layers

Increasing Transformer layers (3-4 layers) provides no improvement on obj_pp_to_subj_pp performance, consistent with Petty et al. (2024) findings.

Compositional Generalization Research

COGS Benchmark: Proposed by Kim and Linzen (2020), reporting near-0% structural generalization accuracy for Transformers
ReCOGS Improvements: Wu et al. (2024) achieve non-zero but still low accuracy through semantic exact match
Hierarchical Approaches: Liu et al. (2021), Weißenhorn et al. (2022) achieve high performance using explicit tree structures

Original RASP: Weiss et al. (2021) for analyzing Transformer encoder capabilities
Decoder Extensions: Zhou et al. (2023) extend to autoregressive decoders, analyzing length generalization
Task-Specific Applications: This paper is the first to apply RASP to complex semantic parsing tasks

Attraction Error Research

Linguistic Foundation: Jespersen (1954) describes subject-verb agreement attraction errors
Neural Network Attraction: van Schijndel et al. (2019), Goldberg (2019) observe similar phenomena in Transformers

Conclusions and Discussion

Main Conclusions

Theoretical Feasibility: Transformers can theoretically solve COGS tasks through flat pattern matching without hierarchical representations.
Key Mechanism: Masking prepositional phrase nouns is crucial for avoiding attraction errors.
Learning Problem: Current Transformer failures are learning problems rather than capability limitations.
Error Predictability: Specific baseline model errors can be accurately predicted based on flat processing assumptions.

Limitations

Manual Construction: The RASP model is hand-designed rather than learned.
Vocabulary Constraints: Assumes part-of-speech and verb-type mappings are known, does not address vocabulary generalization.
Language-Specific: Focused on English; applicability to other languages remains unknown.
Task-Specific: Model is specifically designed for COGS, not a general language model.

Future Directions

Learning Algorithms: Research how to enable Transformers to learn similar masking rules.
Training Objectives: Explore data augmentation, curriculum learning, reinforcement learning, and other methods.
Architecture Improvements: Design better inductive biases to promote compositional generalization.
Multilingual Extension: Validate method effectiveness on other languages.

In-Depth Evaluation

Strengths

Theoretical Contribution: Clarifies the theoretical capability boundaries of Transformers through constructive proof.
Methodological Innovation: The proposed flat solution challenges assumptions about the necessity of hierarchical representations.
Empirical Rigor: Detailed error analysis and predictive validation strengthen conclusion credibility.
Engineering Completeness: Provides complete reproducible code and detailed implementation documentation.
Deep Insights: Attraction error theory provides new perspectives for understanding Transformer failures.

Weaknesses

Practical Limitations: RASP models run extremely slowly, suitable only for research rather than practical applications.
Missing Learning: Does not address the core question of how to enable Transformers to automatically learn these rules.
Limited Evaluation Scope: Primarily focuses on structural generalization with insufficient attention to vocabulary generalization.
Strong Assumptions: The assumption that part-of-speech mappings are known may be unrealistic in practical applications.

Impact

Theoretical Impact: Provides new theoretical frameworks and analytical tools for compositional generalization research.
Methodological Impact: RASP analysis methods may be widely applied to other Transformer capability studies.
Practical Guidance: Provides specific technical directions for improving Transformer training.

Applicable Scenarios

Research Tool: As a theoretical tool for analyzing Transformer capabilities.
Benchmark Testing: Provides reference implementations for evaluating compositional generalization capabilities.
Educational Resource: Helps understand Transformer internal mechanisms.
Algorithm Design: Provides inspiration for designing better compositional generalization algorithms.

References

Kim, N., & Linzen, T. (2020). COGS: A compositional generalization challenge based on semantic interpretation. EMNLP 2020.
Wu, Z., Manning, C. D., & Potts, C. (2024). ReCOGS: How incidental details of a logical form overshadow an evaluation of semantic interpretation. TACL.
Weiss, G., Goldberg, Y., & Yahav, E. (2021). Thinking like transformers. NeurIPS 2021.
Zhou, H., et al. (2023). What algorithms can transformers learn? A study in length generalization. arXiv preprint.
Zeller, A., et al. (2023). Grammar coverage. In The Fuzzing Book.

Through rigorous theoretical analysis and empirical validation, this paper provides important insights into understanding Transformer capabilities and limitations on compositional generalization tasks. While it has certain practical limitations, its theoretical contributions and methodological innovations hold significant value for advancing related research.