2025-11-12T15:16:15.308508

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Glukhov, Conti, Bogomolov et al.

Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.

academic

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Basic Information

Paper ID: 2510.12487
Title: Diff-XYZ: A Benchmark for Evaluating Diff Understanding
Authors: Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov (JetBrains Research)
Classification: cs.SE (Software Engineering), cs.LG (Machine Learning)
Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Deep Learning for Code in the Agentic Era
Paper Link: https://arxiv.org/abs/2510.12487

Abstract

This paper proposes the Diff-XYZ benchmark for evaluating large language models' understanding of code diffs. The benchmark comprises three supervised learning tasks: apply (old code + diff → new code), anti-apply (new code - diff → old code), and diff generation (new code - old code → diff). The benchmark data derives from real commits in CommitPackFT, containing 1,000 triplet instances ⟨old code, new code, diff⟩. The study finds that different diff formats should be selected based on use cases and model sizes, providing an important foundation for future code editing model development.

Research Background and Motivation

Core Problems

Modern code editing agents require reliable handling of code diffs in large-scale repository editing and refactoring, yet existing evaluation methods face the following issues:

Lack of systematic format selection research: While multiple diff representation formats exist (unified diff, search-replace, etc.), systematic format comparison studies are lacking
Evaluation complexity: Existing end-to-end benchmarks (e.g., SWE-bench) conflate multiple factors (retrieval, tool usage, etc.), making it difficult to isolate the impact of diff formats
Diverse failure modes: Patches may fail due to syntax errors, context mismatches, or logical errors, requiring more fine-grained analysis

Research Significance

Practical value: Code diff processing is a core capability for automated code editing, bug fixing, CI build repair, and other tasks
Theoretical significance: Understanding how LLMs process structured editing information is crucial for improving code generation models
Engineering value: Provides data-driven guidance for selecting appropriate diff formats

Core Contributions

Proposes Diff-XYZ benchmark: A lightweight, reproducible evaluation framework with three complementary tasks
Systematic format comparison: First controlled-variable comparison of multiple diff representation formats
Establishes performance baselines: Detailed performance baselines for proprietary and open-source models on diff understanding tasks
Format selection guidance: Reveals relationships between model size, task type, and optimal format choices
Open dataset: Publishes high-quality evaluation dataset on HuggingFace Hub

Methodology Details

Task Definition

Based on the equation diff = new code - old code, three tasks are defined:

X. Apply Task

Input: Old code + diff
Output: New code
Objective: Tests format adherence and character-level fidelity

Y. Anti-Apply Task

Input: New code + diff
Output: Old code
Objective: Probes format reversibility and losslessness

Z. Diff Generation Task

Input: Old code + new code
Output: diff
Objective: Tests reliable diff synthesis capability

Dataset Construction

Data Source: Real open-source commits from the CommitPackFT dataset

Filtering Strategy:

Retains only single-file modification commits
Excludes binary files, generated code, and vendor directories
File line count constraint: 40-1,000 lines
Excludes whitespace-only changes

Stratified Sampling:

Language distribution: Python, JavaScript, Java, Kotlin, Rust with 200 samples each
Edit complexity: Stratified by change block count and change size
- Small edits: ≤7 line changes (40%)
- Medium edits: 8-24 line changes (40%)
- Large edits: >24 line changes (20%)
Change types: 81.5% contain additions/deletions, 16.3% additions only, 2.2% deletions only

Evaluation Metrics

Apply and Anti-Apply Tasks:

Stripped Exact Match (EM): Exact match rate after removing blank lines
Stripped Intersection over Union (IoU): Line-level intersection over union

Diff Generation Task:

Parsing Rate: Proportion of parseable diffs
Applying Rate: Proportion of successfully applicable diffs
EM/IoU after application: Exact match rate and IoU after applying diff
F1+ / F1-: F1 scores for added and deleted lines

Technical Innovations

Complementary task design: Three tasks comprehensively evaluate diff understanding from different angles
Controlled variable experiments: Precisely measures format impact by fixing contextual changes
Real data-driven approach: Based on real commits rather than synthetic data, ensuring ecological validity
Multi-dimensional evaluation: Combines syntactic correctness, application success rate, and semantic correctness

Experimental Setup

Compared Formats

udiff: Standard unified diff format
udiff-h: Relaxed block header unified diff
udiff-l: Unified diff with explicit markers (ADD/DEL/CON)
search-replace: Search-replace format

Test Models

Proprietary Models:

GPT-4o, GPT-4o-mini
GPT-4.1, GPT-4.1-mini, GPT-4.1-nano
Claude 4 Sonnet
Gemini 2.5 Flash

Open-Source Models:

Qwen2.5-Coder series (0.5B-32B)

Prompting Strategies

w/o format: Generic assistant prompt
w/ format: System prompt with format description

Experimental Results

Main Results

Proprietary Model Performance:

Claude 4 Sonnet achieves best performance on Apply task (EM: 0.95-0.96)
GPT-4.1 demonstrates strong performance across all tasks but is sensitive to prompts
Smaller proprietary models (e.g., GPT-4.1-nano) show significant degradation on complex tasks

Open-Source Model Scaling Laws:

Performance clearly improves with model scale
Qwen2.5-Coder-32B approaches GPT-4o level on Apply/Anti-Apply tasks
Significant gap remains on Diff Generation

Format Comparison Findings

Key Discoveries:

Task Dependency:
- Apply/Anti-Apply: udiff format performs best
- Diff Generation: search-replace superior for large models
Model Scale Effects:
- Large models: search-replace excels in generation tasks
- Small models: udiff-l (explicit markers) performs best
Format Characteristic Analysis:
- search-replace advantages: Avoids global constraints, independent local edits
- udiff-h disadvantages: Removing line number scaffolding causes structural confusion
- udiff-l advantages: Explicit markers reduce marker conflicts

Ablation Study Results

Prompt Impact:

Diff Generation task highly sensitive to format descriptions
GPT-4.1 tends to output V4A format without format descriptions
Apply task relatively robust to prompt variations

Language Differences:

Relatively consistent performance across five programming languages
Slightly better performance on Python and JavaScript

Code Generation Benchmarks

HumanEval/MBPP: Function-level code generation
BigCodeBench: Complex library call tasks
Limitations: Primarily focus on generation from scratch, not editing representations

Editing and Problem-Solving Benchmarks

SWE-bench: Real GitHub issue resolution
CodeEditorBench: Instruction-following editing
Limitations: End-to-end evaluation, difficult to isolate format impact

This Paper's Position

Complementarity: Focuses on isolated study of editing representations
Lightweight: No repository setup or execution environment required
Controllability: Fixes task context, varies representation formats

Conclusions and Discussion

Main Conclusions

Format selection is critical: Different formats show significant performance variations across tasks and model sizes
Task specificity: Generation and application tasks require different optimal formats
Scale dependency: Optimal strategies differ for small and large models
Reality gap: Open-source models still have substantial room for improvement in diff generation

Limitations

Task simplification: Benchmark tasks are simplified proxies for downstream applications
Evaluation scope: Considers only greedy decoding, not reasoning or tool usage
Format coverage: Does not cover AST-level or structured patch formats
Downstream connection: Lacks quantitative association with actual application performance

Future Directions

Extended formats: Explore tree-structured and AST-level edit representations
Downstream connection: Establish relationships between benchmark performance and actual application effectiveness
Reasoning capability: Evaluate multi-step reasoning and tool usage scenarios
Error recovery: Study handling of partial or corrupted diffs

In-Depth Evaluation

Strengths

Clear problem definition: Accurately identifies diff understanding as a core yet overlooked capability
Rigorous experimental design: Scientifically sound controlled-variable format comparison methodology
High data quality: Based on real commits with stratified sampling ensuring representativeness
Valuable findings: Format selection guidance has direct practical value
Strong reproducibility: Detailed experimental setup and open dataset

Weaknesses

Limited theoretical depth: Insufficient theoretical analysis of format difference mechanisms
Single evaluation dimension: Primarily focuses on correctness, lacking efficiency and readability considerations
Incomplete model coverage: Open-source models primarily concentrated in Qwen series
Limited application scenarios: Does not consider interactive editing and incremental update scenarios

Impact

Academic value: Fills important gap in code editing evaluation, likely to inspire follow-up research
Engineering value: Provides data support for industry format selection
Community contribution: Open benchmark and dataset will benefit the entire research community
Standardization potential: May become standard benchmark for code editing capability evaluation

Applicable Scenarios

Model development: Capability evaluation and improvement of code editing models
Format design: Effectiveness verification of new diff formats
Tool selection: Format strategy selection for code editing tools
Research foundation: Baseline capability testing for complex code editing tasks

References

The paper cites 31 related references, primarily including:

Code generation benchmarks: HumanEval, MBPP, BigCodeBench, etc.
Editing evaluation: SWE-bench, CodeEditorBench, etc.
Model technical reports: GPT-4o, Claude, Qwen2.5-Coder, etc.
Tools and formats: Aider, GNU diffutils, etc.

Overall Assessment: This is a high-quality systematic research paper that accurately identifies and addresses an important problem in code editing. While somewhat limited in theoretical depth, its practical value and methodological contributions are significant, with important implications for advancing code editing technology.