2025-11-12T15:16:15.308508

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Glukhov, Conti, Bogomolov et al.

Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.

academic

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

基本信息

论文ID: 2510.12487
标题: Diff-XYZ: A Benchmark for Evaluating Diff Understanding
作者: Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov (JetBrains Research)
分类: cs.SE (软件工程), cs.LG (机器学习)
发表会议: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Deep Learning for Code in the Agentic Era
论文链接: https://arxiv.org/abs/2510.12487

格式选择缺乏系统研究：虽然存在多种diff表示格式(unified diff、search-replace等)，但缺乏系统性的格式比较研究
评估复杂性：现有端到端基准(如SWE-bench)混合了多种因素(检索、工具使用等)，难以隔离diff格式的影响
失败模式多样：补丁可能因语法错误、上下文不匹配或逻辑错误而失败，需要更细粒度的分析

研究重要性

实用价值：代码diff处理是自动化代码编辑、bug修复、CI构建修复等任务的核心能力
理论意义：理解LLM如何处理结构化编辑信息对改进代码生成模型具有重要意义
工程价值：为选择合适的diff格式提供数据驱动的指导

核心贡献

提出Diff-XYZ基准：包含三个互补任务的轻量级、可复现的评估框架
系统性格式比较：首次对多种diff表示格式进行控制变量的系统比较
建立性能基线：为专有模型和开源模型在diff理解任务上建立了详细的性能基线
格式选择指导：发现了模型大小、任务类型与最优格式选择之间的关系
开放数据集：在HuggingFace Hub上发布了高质量的评估数据集

方法详解

任务定义

基于方程式 diff = new code - old code，定义三个任务：

X. Apply Task（应用任务）

输入：旧代码 + diff
输出：新代码
目标：测试格式遵循性和字符级保真度

Y. Anti-Apply Task（反向应用任务）

输入：新代码 + diff
输出：旧代码
目标：探测格式的可逆性和无损性

Z. Diff Generation Task（差异生成任务）

输入：旧代码 + 新代码
输出：diff
目标：测试可靠的diff合成能力

数据集构建

数据来源：CommitPackFT数据集中的真实开源提交

过滤策略：

仅保留单文件修改的提交
排除二进制文件、生成代码、供应商目录
文件行数限制：40-1000行
排除仅空白字符变更

分层采样：

语言分布：Python、JavaScript、Java、Kotlin、Rust各200个样本
编辑复杂度：按变更块数和变更大小分层
- 小型编辑：≤7行变更(40%)
- 中型编辑：8-24行变更(40%)
- 大型编辑：>24行变更(20%)
变更类型：81.5%包含增删，16.3%仅增加，2.2%仅删除

评价指标

Apply和Anti-Apply任务：

Stripped Exact Match (EM)：去除空白行后的精确匹配率
Stripped Intersection over Union (IoU)：行级别的交并比

Diff Generation任务：

Parsing Rate：可解析的diff比例
Applying Rate：可成功应用的diff比例
EM/IoU after application：应用diff后的精确匹配率和IoU
F1+ / F1-：添加行和删除行的F1分数

技术创新点

任务设计的互补性：三个任务从不同角度全面评估diff理解能力
控制变量实验：通过固定上下文变化格式，精确测量格式影响
真实数据驱动：基于真实提交而非合成数据，确保生态有效性
多维度评估：结合语法正确性、应用成功率和语义正确性

实验设置

对比格式

udiff：标准unified diff格式
udiff-h：放宽块头的unified diff
udiff-l：使用显式标记(ADD/DEL/CON)的unified diff
search-replace：搜索替换格式

测试模型

专有模型：

GPT-4o, GPT-4o-mini
GPT-4.1, GPT-4.1-mini, GPT-4.1-nano
Claude 4 Sonnet
Gemini 2.5 Flash

开源模型：

Qwen2.5-Coder系列 (0.5B-32B)

提示策略

w/o format：通用助手提示
w/ format：包含格式描述的系统提示

实验结果

主要结果

专有模型表现：

Claude 4 Sonnet在Apply任务上表现最佳(EM: 0.95-0.96)
GPT-4.1在所有任务上都有强劲表现，但对提示敏感
较小的专有模型(如GPT-4.1-nano)在复杂任务上显著下降

开源模型缩放规律：

性能随模型规模明确提升
Qwen2.5-Coder-32B在Apply/Anti-Apply上接近GPT-4o水平
但在Diff Generation上仍有显著差距

格式比较发现

关键发现：

任务依赖性：
- Apply/Anti-Apply：udiff格式表现最佳
- Diff Generation：search-replace对大模型更优
模型规模效应：
- 大模型：search-replace在生成任务中表现突出
- 小模型：udiff-l(显式标记)效果最佳
格式特性分析：
- search-replace优势：避免全局约束，局部编辑独立
- udiff-h劣势：去除行号支架导致结构混乱
- udiff-l优势：显式标记减少标记冲突