2025-11-24T22:34:17.172236

Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

Bruns

Humans understand new combinations of words encountered if they are combinations of words recognized from different contexts, an ability called Compositional Generalization. The COGS benchmark (Kim and Linzen, 2020) arXiv:2010.05465 reports 0% accuracy for Transformer models on some structural generalizations. We use (Weiss et al., 2021) arXiv:2106.06981's Restricted Access Sequence Processing (RASP), a Transformer-equivalent programming language, to demonstrate that a Transformer Encoder-Decoder can perform COGS and the semantically equivalent ReCOGS_pos (Wu et al., 2024) arXiv:2303.13716 systematically and compositionally: Our RASP models attain near perfect scores on structural generalization splits on COGS (exact match) and ReCOGS_pos (semantic exact match). Our RASP models show the (Re)COGS tasks do not require a hierarchical or tree-structured solution (contrary to (Kim and Linzen, 2020) arXiv:2010.05465, (Yao and Koller, 2022) arXiv:2210.13050, (Murty et al., 2022) arXiv:2211.01288, (Liu et al., 2021) arXiv:2107.06516): we use word-level tokens with an "embedding" layer that tags with possible part of speech, applying just once per encoder pass 19 attention-head compatible flat pattern-matching rules (easily identified with specific training examples), shown using grammar coverage (Zeller et al., 2023) to cover the non-recursive aspects of the input grammar, plus masking out prepositional phrases ("pp noun") and/or sentential complements (cp) when recognizing grammar patterns and extracting nouns related to the main verb in the sentence, and output the next logical form (LF) token (repeating until the LF is complete). The models do not apply recursive, tree-structured rules like "np_det pp np -> np_pp -> np", but score near perfect semantic and string exact match on both COGS and ReCOGS pp recursion, cp recursion using the decoder loop.

academic

Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

基本信息

论文ID: 2504.15349
标题: Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)
作者: William Bruns
分类: cs.CL (Computational Linguistics)
发表时间: 2025年10月14日 (arXiv v3)
论文链接: https://arxiv.org/abs/2504.15349v3

摘要

人类能够理解由不同语境中识别的词汇组成的新组合，这种能力称为组合泛化(Compositional Generalization)。COGS基准测试报告Transformer模型在某些结构泛化上准确率为0%。本文使用RASP(限制访问序列处理)语言证明Transformer编码器-解码器可以系统性和组合性地执行COGS和语义等价的ReCOGS_pos任务：RASP模型在结构泛化分割上取得接近完美的分数。研究表明(Re)COGS任务不需要层次化或树结构解决方案，而是使用19个注意力头兼容的平面模式匹配规则，通过掩蔽介词短语和从句来识别语法模式。

理论意义：组合泛化是人类语言理解的核心能力，理解神经网络如何实现这种能力对于推进AI的语言理解至关重要
实践意义：当前Transformer模型在结构泛化任务上接近0%的准确率表明存在根本性限制，需要找到解决方案

现有方法局限性

浅层网络限制：Kim和Linzen (2020)使用的2层Encoder-Decoder在结构泛化上表现极差
层次假设错误：现有研究假设需要树结构或层次化表示才能解决COGS任务
深度无效性：Petty等人(2024)发现即使增加到32层，Transformer在COGS结构泛化上仍无改善

研究动机

作者受到Zhou等人(2023)使用RASP分析Transformer泛化能力的启发，试图通过构造性证明来展示Transformer理论上可以解决COGS任务，并分析现有模型失败的原因。

核心贡献

构造性证明：使用RASP语言证明Transformer Encoder-Decoder理论上可以系统性地解决COGS和ReCOGS_pos任务
平面解决方案：提出基于19个平面模式匹配规则的非层次化解决方案，无需递归树结构规则
错误分析：通过"吸引错误"(attraction errors)理论预测并验证了基线Transformer的具体错误模式
性能突破：RASP模型在COGS上达到99.89%字符串精确匹配，在ReCOGS_pos上达到99.63%语义精确匹配
新泛化分割：发现并验证了新的困难泛化分割"v_dat_p2_pp_moved_to_recipient"

方法详解

任务定义

COGS/ReCOGS任务要求将简化英语语法的句子转换为逻辑形式(LF)：

输入：英语句子(如"A scientist lended a cat a donut")
输出：逻辑形式(如"scientist(1); cat(4); donut(6); lend(2) AND agent(2,1) AND recipient(2,4) AND theme(2,6)")
评估：字符串精确匹配(COGS)或语义精确匹配(ReCOGS)

模型架构

RASP编程框架

RASP是一种可编译为Transformer权重的编程语言，本文使用其构建Encoder-Decoder模型：

嵌入层：将词级token映射到词性和动词类型标签
编码器：使用19个注意力头兼容的平面模式匹配器
解码器循环：自回归生成逻辑形式token

核心组件设计

1. 词性嵌入映射

词汇 → {det: 1, common_noun: 7, proper_noun: 8, v_dat: 18, ...}

2. 平面模式匹配器 19个模式涵盖所有非递归语法规则，例如：

np v_dat_p2 np np (如"Liam forwarded the girl the donut")
np was v_trans_omissible_pp_p2 by np (被动语态)

3. 掩蔽机制 关键创新：在提取名词-动词关系时掩蔽介词短语名词：

no_pp_np_mask = 1 - aggregate((pp_one_after_mask and np_prop_diag_mask) or 
                              (pp_two_after_mask and np_det_diag_mask), 1)

技术创新点

1. 非递归解决方案

与传统假设不同，模型不使用递归规则如np_det pp np → np_pp → np，而是：

在编码器中识别主要语法模式
在解码器中展开递归结构

2. 吸引错误避免

通过掩蔽机制避免介词短语中的名词"吸引"错误的语法关系：

错误：The cake on the plate burned → theme(burn, plate)  # 吸引错误
正确：The cake on the plate burned → theme(burn, cake)   # 掩蔽后

COGS：24,155个训练样例，3,000个测试样例，21,000个泛化样例
ReCOGS_pos：使用位置索引的ReCOGS版本，语义等价但允许语义精确匹配
语法覆盖：使用Zeller等人(2023)的方法验证19个规则覆盖100%非递归语法

评价指标

字符串精确匹配：完全相同的逻辑形式字符串
语义精确匹配：语义等价但索引和顺序可不同的逻辑形式
语法覆盖率：模型支持的语法扩展占总语法的比例

对比方法

Wu等人(2024)基线：2层Encoder-Decoder Transformer
层数变体：3层和4层版本
数据增强版本：添加特定介词短语修饰样例

实现细节

使用官方RASP解释器评估程序
词汇映射基于COGS训练集中的所有词汇
确定性程序使用Clopper-Pearson置信区间

测试集：99.97% (99.81-99.99%)
obj_pp_to_subj_pp：100.00% (99.63-100.00%)
pp_recursion：98.40% (97.41-99.08%)
cp_recursion：99.90% (99.44-99.997%)
总体泛化：99.89% (99.83-99.93%)

ReCOGS_pos (语义精确匹配)

测试集：100.00% (99.88-100.00%)
obj_pp_to_subj_pp：92.20% (90.36-93.79%)
pp_recursion：100.00% (99.63-100.00%)
cp_recursion：100.00% (99.63-100.00%)
总体泛化：99.63% (99.54-99.71%)

基线Transformer性能对比

Wu等人(2024)基线 (ReCOGS_pos)

pp_recursion：40.2% ± 9.3%
cp_recursion：52.4% ± 1.4%
obj_pp_to_subj_pp：19.7% ± 6.1%

吸引错误分析

对基线Transformer的错误分析验证了理论预测：

**单关系错误中96.73%**符合吸引错误模式
**深度2介词短语错误100%**指向最近的介词名词
证实了非层次化线性处理假设

新泛化分割验证

"v_dat_p2_pp_moved_to_recipient"分割：

基线性能：13% ± 15.6% (与最难分割相当)
支持平面处理假设而非树结构假设

理论可行性：Transformer理论上可以通过平面模式匹配解决COGS任务，无需层次化表示
关键机制：掩蔽介词短语名词是避免吸引错误的关键
学习问题：当前Transformer的失败是学习问题而非能力限制
错误可预测：基于平面处理假设可准确预测基线模型的具体错误

局限性

手工构造：RASP模型是手工设计的，不是学习得到的
词汇限制：假设词性和动词类型映射已知，未解决词汇泛化
语言特定：仅针对英语，其他语言的适用性未知
任务特定：模型专门为COGS设计，不是通用语言模型

未来方向

学习算法：研究如何让Transformer学习到类似的掩蔽规则
训练目标：探索数据增强、课程学习、强化学习等方法
架构改进：设计更好的归纳偏置来促进组合泛化
多语言扩展：验证方法在其他语言上的有效性

深度评价

优点

理论贡献：通过构造性证明澄清了Transformer的理论能力边界
方法创新：提出的平面解决方案挑战了层次化表示的必要性假设
实证严谨：详细的错误分析和预测验证增强了结论的可信度
工程完整：提供完整的可复现代码和详细的实现文档
洞察深刻：吸引错误理论为理解Transformer失败提供了新视角

不足

实用性限制：RASP模型运行速度极慢，仅适用于研究而非实际应用
学习缺失：未解决如何让Transformer自动学习到这些规则的核心问题
评估范围：主要关注结构泛化，对词汇泛化关注不足
假设强度：词性映射已知的假设在实际应用中可能不现实

影响力

理论影响：为组合泛化研究提供了新的理论框架和分析工具
方法影响：RASP分析方法可能被广泛应用于其他Transformer能力研究
实践指导：为改进Transformer训练提供了具体的技术方向

适用场景

研究工具：作为分析Transformer能力的理论工具
基准测试：为评估组合泛化能力提供参考实现
教学资源：帮助理解Transformer的内部工作机制
算法设计：为设计更好的组合泛化算法提供启发

参考文献

Kim, N., & Linzen, T. (2020). COGS: A compositional generalization challenge based on semantic interpretation. EMNLP 2020.
Wu, Z., Manning, C. D., & Potts, C. (2024). ReCOGS: How incidental details of a logical form overshadow an evaluation of semantic interpretation. TACL.
Weiss, G., Goldberg, Y., & Yahav, E. (2021). Thinking like transformers. NeurIPS 2021.
Zhou, H., et al. (2023). What algorithms can transformers learn? A study in length generalization. arXiv preprint.
Zeller, A., et al. (2023). Grammar coverage. In The Fuzzing Book.

这篇论文通过严谨的理论分析和实证验证，为理解Transformer在组合泛化任务上的能力和局限性提供了重要洞察。虽然存在一些实用性限制，但其理论贡献和方法创新对推进相关研究具有重要价值。