2025-11-24T22:34:17.172236

Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

Bruns
Humans understand new combinations of words encountered if they are combinations of words recognized from different contexts, an ability called Compositional Generalization. The COGS benchmark (Kim and Linzen, 2020) arXiv:2010.05465 reports 0% accuracy for Transformer models on some structural generalizations. We use (Weiss et al., 2021) arXiv:2106.06981's Restricted Access Sequence Processing (RASP), a Transformer-equivalent programming language, to demonstrate that a Transformer Encoder-Decoder can perform COGS and the semantically equivalent ReCOGS_pos (Wu et al., 2024) arXiv:2303.13716 systematically and compositionally: Our RASP models attain near perfect scores on structural generalization splits on COGS (exact match) and ReCOGS_pos (semantic exact match). Our RASP models show the (Re)COGS tasks do not require a hierarchical or tree-structured solution (contrary to (Kim and Linzen, 2020) arXiv:2010.05465, (Yao and Koller, 2022) arXiv:2210.13050, (Murty et al., 2022) arXiv:2211.01288, (Liu et al., 2021) arXiv:2107.06516): we use word-level tokens with an "embedding" layer that tags with possible part of speech, applying just once per encoder pass 19 attention-head compatible flat pattern-matching rules (easily identified with specific training examples), shown using grammar coverage (Zeller et al., 2023) to cover the non-recursive aspects of the input grammar, plus masking out prepositional phrases ("pp noun") and/or sentential complements (cp) when recognizing grammar patterns and extracting nouns related to the main verb in the sentence, and output the next logical form (LF) token (repeating until the LF is complete). The models do not apply recursive, tree-structured rules like "np_det pp np -> np_pp -> np", but score near perfect semantic and string exact match on both COGS and ReCOGS pp recursion, cp recursion using the decoder loop.
academic

Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

Basic Information

  • Paper ID: 2504.15349
  • Title: Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)
  • Author: William Bruns
  • Category: cs.CL (Computational Linguistics)
  • Publication Date: October 14, 2025 (arXiv v3)
  • Paper Link: https://arxiv.org/abs/2504.15349v3

Abstract

Humans possess the ability to understand novel combinations of words identified in different contexts, a capability termed compositional generalization. The COGS benchmark reports that Transformer models achieve 0% accuracy on certain structural generalization tasks. This paper uses RASP (Restricted Access Sequence Processing) language to demonstrate that Transformer encoder-decoder architectures can systematically and compositionally execute COGS and semantically equivalent ReCOGS_pos tasks: the RASP model achieves near-perfect scores on structural generalization splits. The research reveals that (Re)COGS tasks do not require hierarchical or tree-structured solutions; instead, they employ 19 attention-head-compatible flat pattern-matching rules that identify grammatical patterns through masking prepositional phrases and clauses.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is the limitation of Transformer models' capabilities on compositional generalization tasks, particularly their poor performance on the COGS (Compositional Generalization Challenge based on Semantic Interpretation) benchmark.

Significance

  1. Theoretical Significance: Compositional generalization is a core capability of human language understanding. Understanding how neural networks implement this capability is crucial for advancing AI's language comprehension.
  2. Practical Significance: The near-0% accuracy of current Transformer models on structural generalization tasks indicates fundamental limitations that require solutions.

Limitations of Existing Approaches

  1. Shallow Network Constraints: The 2-layer Encoder-Decoder used by Kim and Linzen (2020) performs extremely poorly on structural generalization.
  2. Incorrect Hierarchical Assumptions: Existing research assumes that tree structures or hierarchical representations are necessary to solve COGS tasks.
  3. Ineffectiveness of Depth: Petty et al. (2024) found that even increasing to 32 layers provides no improvement for Transformers on COGS structural generalization.

Research Motivation

The author was inspired by Zhou et al. (2023), who used RASP to analyze Transformer generalization capabilities. The goal is to demonstrate through constructive proof that Transformers can theoretically solve COGS tasks and analyze why existing models fail.

Core Contributions

  1. Constructive Proof: Uses RASP language to prove that Transformer Encoder-Decoder architectures can theoretically solve COGS and ReCOGS_pos tasks systematically.
  2. Flat Solution: Proposes a non-hierarchical solution based on 19 flat pattern-matching rules without requiring recursive tree-structured rules.
  3. Error Analysis: Predicts and validates specific error patterns in baseline Transformers through "attraction error" theory.
  4. Performance Breakthrough: RASP model achieves 99.89% string exact match on COGS and 99.63% semantic exact match on ReCOGS_pos.
  5. New Generalization Split: Discovers and validates a new challenging generalization split "v_dat_p2_pp_moved_to_recipient".

Methodology Details

Task Definition

COGS/ReCOGS tasks require converting sentences with simplified English grammar into logical forms (LF):

  • Input: English sentence (e.g., "A scientist lended a cat a donut")
  • Output: Logical form (e.g., "scientist(1); cat(4); donut(6); lend(2) AND agent(2,1) AND recipient(2,4) AND theme(2,6)")
  • Evaluation: String exact match (COGS) or semantic exact match (ReCOGS)

Model Architecture

RASP Programming Framework

RASP is a programming language compilable into Transformer weights. This paper uses it to construct an Encoder-Decoder model:

  1. Embedding Layer: Maps word-level tokens to part-of-speech and verb-type labels
  2. Encoder: Uses 19 attention-head-compatible flat pattern matchers
  3. Decoder Loop: Autoregressively generates logical form tokens

Core Component Design

1. Part-of-Speech Embedding Mapping

Vocabulary → {det: 1, common_noun: 7, proper_noun: 8, v_dat: 18, ...}

2. Flat Pattern Matchers 19 patterns cover all non-recursive grammatical rules, such as:

  • np v_dat_p2 np np (e.g., "Liam forwarded the girl the donut")
  • np was v_trans_omissible_pp_p2 by np (passive voice)

3. Masking Mechanism Key innovation: Masks prepositional phrase nouns when extracting noun-verb relationships:

no_pp_np_mask = 1 - aggregate((pp_one_after_mask and np_prop_diag_mask) or 
                              (pp_two_after_mask and np_det_diag_mask), 1)

Technical Innovations

1. Non-Recursive Solution

Unlike traditional assumptions, the model does not use recursive rules such as np_det pp np → np_pp → np. Instead, it:

  • Identifies primary grammatical patterns in the encoder
  • Unfolds recursive structures in the decoder

2. Attraction Error Avoidance

Prevents nouns in prepositional phrases from "attracting" incorrect grammatical relationships through masking:

Error: The cake on the plate burned → theme(burn, plate)  # Attraction error
Correct: The cake on the plate burned → theme(burn, cake)   # After masking

3. Decoder Loop Unfolding

Recursive structures are processed through decoder loops, supporting arbitrary depth of prepositional phrase and clause nesting.

Experimental Setup

Datasets

  • COGS: 24,155 training examples, 3,000 test examples, 21,000 generalization examples
  • ReCOGS_pos: ReCOGS version using position indices, semantically equivalent but allowing semantic exact match
  • Grammar Coverage: Validates that 19 rules cover 100% of non-recursive grammar using Zeller et al. (2023) methodology

Evaluation Metrics

  • String Exact Match: Identical logical form strings
  • Semantic Exact Match: Semantically equivalent logical forms with potentially different indices and ordering
  • Grammar Coverage Rate: Proportion of total grammar supported by the model

Comparison Methods

  • Wu et al. (2024) Baseline: 2-layer Encoder-Decoder Transformer
  • Layer Variants: 3-layer and 4-layer versions
  • Data-Augmented Versions: Versions with added prepositional phrase modification examples

Implementation Details

  • Uses official RASP interpreter for program evaluation
  • Vocabulary mapping based on all words in COGS training set
  • Deterministic programs use Clopper-Pearson confidence intervals

Experimental Results

Main Results

RASP Model Performance

COGS (String Exact Match)

  • Test set: 99.97% (99.81-99.99%)
  • obj_pp_to_subj_pp: 100.00% (99.63-100.00%)
  • pp_recursion: 98.40% (97.41-99.08%)
  • cp_recursion: 99.90% (99.44-99.997%)
  • Overall generalization: 99.89% (99.83-99.93%)

ReCOGS_pos (Semantic Exact Match)

  • Test set: 100.00% (99.88-100.00%)
  • obj_pp_to_subj_pp: 92.20% (90.36-93.79%)
  • pp_recursion: 100.00% (99.63-100.00%)
  • cp_recursion: 100.00% (99.63-100.00%)
  • Overall generalization: 99.63% (99.54-99.71%)

Baseline Transformer Performance Comparison

Wu et al. (2024) Baseline (ReCOGS_pos)

  • pp_recursion: 40.2% ± 9.3%
  • cp_recursion: 52.4% ± 1.4%
  • obj_pp_to_subj_pp: 19.7% ± 6.1%

Attraction Error Analysis

Error analysis of baseline Transformer validates theoretical predictions:

  • 96.73% of single-relation errors conform to attraction error patterns
  • 100% of depth-2 prepositional phrase errors point to the nearest prepositional noun
  • Confirms non-hierarchical linear processing hypothesis

New Generalization Split Validation

"v_dat_p2_pp_moved_to_recipient" split:

  • Baseline performance: 13% ± 15.6% (comparable to most difficult splits)
  • Supports flat processing hypothesis over tree-structured hypothesis

Ineffectiveness of Increased Layers

Increasing Transformer layers (3-4 layers) provides no improvement on obj_pp_to_subj_pp performance, consistent with Petty et al. (2024) findings.

Compositional Generalization Research

  • COGS Benchmark: Proposed by Kim and Linzen (2020), reporting near-0% structural generalization accuracy for Transformers
  • ReCOGS Improvements: Wu et al. (2024) achieve non-zero but still low accuracy through semantic exact match
  • Hierarchical Approaches: Liu et al. (2021), Weißenhorn et al. (2022) achieve high performance using explicit tree structures
  • Original RASP: Weiss et al. (2021) for analyzing Transformer encoder capabilities
  • Decoder Extensions: Zhou et al. (2023) extend to autoregressive decoders, analyzing length generalization
  • Task-Specific Applications: This paper is the first to apply RASP to complex semantic parsing tasks

Attraction Error Research

  • Linguistic Foundation: Jespersen (1954) describes subject-verb agreement attraction errors
  • Neural Network Attraction: van Schijndel et al. (2019), Goldberg (2019) observe similar phenomena in Transformers

Conclusions and Discussion

Main Conclusions

  1. Theoretical Feasibility: Transformers can theoretically solve COGS tasks through flat pattern matching without hierarchical representations.
  2. Key Mechanism: Masking prepositional phrase nouns is crucial for avoiding attraction errors.
  3. Learning Problem: Current Transformer failures are learning problems rather than capability limitations.
  4. Error Predictability: Specific baseline model errors can be accurately predicted based on flat processing assumptions.

Limitations

  1. Manual Construction: The RASP model is hand-designed rather than learned.
  2. Vocabulary Constraints: Assumes part-of-speech and verb-type mappings are known, does not address vocabulary generalization.
  3. Language-Specific: Focused on English; applicability to other languages remains unknown.
  4. Task-Specific: Model is specifically designed for COGS, not a general language model.

Future Directions

  1. Learning Algorithms: Research how to enable Transformers to learn similar masking rules.
  2. Training Objectives: Explore data augmentation, curriculum learning, reinforcement learning, and other methods.
  3. Architecture Improvements: Design better inductive biases to promote compositional generalization.
  4. Multilingual Extension: Validate method effectiveness on other languages.

In-Depth Evaluation

Strengths

  1. Theoretical Contribution: Clarifies the theoretical capability boundaries of Transformers through constructive proof.
  2. Methodological Innovation: The proposed flat solution challenges assumptions about the necessity of hierarchical representations.
  3. Empirical Rigor: Detailed error analysis and predictive validation strengthen conclusion credibility.
  4. Engineering Completeness: Provides complete reproducible code and detailed implementation documentation.
  5. Deep Insights: Attraction error theory provides new perspectives for understanding Transformer failures.

Weaknesses

  1. Practical Limitations: RASP models run extremely slowly, suitable only for research rather than practical applications.
  2. Missing Learning: Does not address the core question of how to enable Transformers to automatically learn these rules.
  3. Limited Evaluation Scope: Primarily focuses on structural generalization with insufficient attention to vocabulary generalization.
  4. Strong Assumptions: The assumption that part-of-speech mappings are known may be unrealistic in practical applications.

Impact

  1. Theoretical Impact: Provides new theoretical frameworks and analytical tools for compositional generalization research.
  2. Methodological Impact: RASP analysis methods may be widely applied to other Transformer capability studies.
  3. Practical Guidance: Provides specific technical directions for improving Transformer training.

Applicable Scenarios

  1. Research Tool: As a theoretical tool for analyzing Transformer capabilities.
  2. Benchmark Testing: Provides reference implementations for evaluating compositional generalization capabilities.
  3. Educational Resource: Helps understand Transformer internal mechanisms.
  4. Algorithm Design: Provides inspiration for designing better compositional generalization algorithms.

References

  1. Kim, N., & Linzen, T. (2020). COGS: A compositional generalization challenge based on semantic interpretation. EMNLP 2020.
  2. Wu, Z., Manning, C. D., & Potts, C. (2024). ReCOGS: How incidental details of a logical form overshadow an evaluation of semantic interpretation. TACL.
  3. Weiss, G., Goldberg, Y., & Yahav, E. (2021). Thinking like transformers. NeurIPS 2021.
  4. Zhou, H., et al. (2023). What algorithms can transformers learn? A study in length generalization. arXiv preprint.
  5. Zeller, A., et al. (2023). Grammar coverage. In The Fuzzing Book.

Through rigorous theoretical analysis and empirical validation, this paper provides important insights into understanding Transformer capabilities and limitations on compositional generalization tasks. While it has certain practical limitations, its theoretical contributions and methodological innovations hold significant value for advancing related research.