Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.
Diff-XYZ: A Benchmark for Evaluating Diff Understanding
- Paper ID: 2510.12487
- Title: Diff-XYZ: A Benchmark for Evaluating Diff Understanding
- Authors: Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov (JetBrains Research)
- Classification: cs.SE (Software Engineering), cs.LG (Machine Learning)
- Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Deep Learning for Code in the Agentic Era
- Paper Link: https://arxiv.org/abs/2510.12487
This paper proposes the Diff-XYZ benchmark for evaluating large language models' understanding of code diffs. The benchmark comprises three supervised learning tasks: apply (old code + diff → new code), anti-apply (new code - diff → old code), and diff generation (new code - old code → diff). The benchmark data derives from real commits in CommitPackFT, containing 1,000 triplet instances ⟨old code, new code, diff⟩. The study finds that different diff formats should be selected based on use cases and model sizes, providing an important foundation for future code editing model development.
Modern code editing agents require reliable handling of code diffs in large-scale repository editing and refactoring, yet existing evaluation methods face the following issues:
- Lack of systematic format selection research: While multiple diff representation formats exist (unified diff, search-replace, etc.), systematic format comparison studies are lacking
- Evaluation complexity: Existing end-to-end benchmarks (e.g., SWE-bench) conflate multiple factors (retrieval, tool usage, etc.), making it difficult to isolate the impact of diff formats
- Diverse failure modes: Patches may fail due to syntax errors, context mismatches, or logical errors, requiring more fine-grained analysis
- Practical value: Code diff processing is a core capability for automated code editing, bug fixing, CI build repair, and other tasks
- Theoretical significance: Understanding how LLMs process structured editing information is crucial for improving code generation models
- Engineering value: Provides data-driven guidance for selecting appropriate diff formats
- Proposes Diff-XYZ benchmark: A lightweight, reproducible evaluation framework with three complementary tasks
- Systematic format comparison: First controlled-variable comparison of multiple diff representation formats
- Establishes performance baselines: Detailed performance baselines for proprietary and open-source models on diff understanding tasks
- Format selection guidance: Reveals relationships between model size, task type, and optimal format choices
- Open dataset: Publishes high-quality evaluation dataset on HuggingFace Hub
Based on the equation diff = new code - old code, three tasks are defined:
X. Apply Task
- Input: Old code + diff
- Output: New code
- Objective: Tests format adherence and character-level fidelity
Y. Anti-Apply Task
- Input: New code + diff
- Output: Old code
- Objective: Probes format reversibility and losslessness
Z. Diff Generation Task
- Input: Old code + new code
- Output: diff
- Objective: Tests reliable diff synthesis capability
Data Source: Real open-source commits from the CommitPackFT dataset
Filtering Strategy:
- Retains only single-file modification commits
- Excludes binary files, generated code, and vendor directories
- File line count constraint: 40-1,000 lines
- Excludes whitespace-only changes
Stratified Sampling:
- Language distribution: Python, JavaScript, Java, Kotlin, Rust with 200 samples each
- Edit complexity: Stratified by change block count and change size
- Small edits: ≤7 line changes (40%)
- Medium edits: 8-24 line changes (40%)
- Large edits: >24 line changes (20%)
- Change types: 81.5% contain additions/deletions, 16.3% additions only, 2.2% deletions only
Apply and Anti-Apply Tasks:
- Stripped Exact Match (EM): Exact match rate after removing blank lines
- Stripped Intersection over Union (IoU): Line-level intersection over union
Diff Generation Task:
- Parsing Rate: Proportion of parseable diffs
- Applying Rate: Proportion of successfully applicable diffs
- EM/IoU after application: Exact match rate and IoU after applying diff
- F1+ / F1-: F1 scores for added and deleted lines
- Complementary task design: Three tasks comprehensively evaluate diff understanding from different angles
- Controlled variable experiments: Precisely measures format impact by fixing contextual changes
- Real data-driven approach: Based on real commits rather than synthetic data, ensuring ecological validity
- Multi-dimensional evaluation: Combines syntactic correctness, application success rate, and semantic correctness
- udiff: Standard unified diff format
- udiff-h: Relaxed block header unified diff
- udiff-l: Unified diff with explicit markers (ADD/DEL/CON)
- search-replace: Search-replace format
Proprietary Models:
- GPT-4o, GPT-4o-mini
- GPT-4.1, GPT-4.1-mini, GPT-4.1-nano
- Claude 4 Sonnet
- Gemini 2.5 Flash
Open-Source Models:
- Qwen2.5-Coder series (0.5B-32B)
- w/o format: Generic assistant prompt
- w/ format: System prompt with format description
Proprietary Model Performance:
- Claude 4 Sonnet achieves best performance on Apply task (EM: 0.95-0.96)
- GPT-4.1 demonstrates strong performance across all tasks but is sensitive to prompts
- Smaller proprietary models (e.g., GPT-4.1-nano) show significant degradation on complex tasks
Open-Source Model Scaling Laws:
- Performance clearly improves with model scale
- Qwen2.5-Coder-32B approaches GPT-4o level on Apply/Anti-Apply tasks
- Significant gap remains on Diff Generation
Key Discoveries:
- Task Dependency:
- Apply/Anti-Apply: udiff format performs best
- Diff Generation: search-replace superior for large models
- Model Scale Effects:
- Large models: search-replace excels in generation tasks
- Small models: udiff-l (explicit markers) performs best
- Format Characteristic Analysis:
- search-replace advantages: Avoids global constraints, independent local edits
- udiff-h disadvantages: Removing line number scaffolding causes structural confusion
- udiff-l advantages: Explicit markers reduce marker conflicts
Prompt Impact:
- Diff Generation task highly sensitive to format descriptions
- GPT-4.1 tends to output V4A format without format descriptions
- Apply task relatively robust to prompt variations
Language Differences:
- Relatively consistent performance across five programming languages
- Slightly better performance on Python and JavaScript
- HumanEval/MBPP: Function-level code generation
- BigCodeBench: Complex library call tasks
- Limitations: Primarily focus on generation from scratch, not editing representations
- SWE-bench: Real GitHub issue resolution
- CodeEditorBench: Instruction-following editing
- Limitations: End-to-end evaluation, difficult to isolate format impact
- Complementarity: Focuses on isolated study of editing representations
- Lightweight: No repository setup or execution environment required
- Controllability: Fixes task context, varies representation formats
- Format selection is critical: Different formats show significant performance variations across tasks and model sizes
- Task specificity: Generation and application tasks require different optimal formats
- Scale dependency: Optimal strategies differ for small and large models
- Reality gap: Open-source models still have substantial room for improvement in diff generation
- Task simplification: Benchmark tasks are simplified proxies for downstream applications
- Evaluation scope: Considers only greedy decoding, not reasoning or tool usage
- Format coverage: Does not cover AST-level or structured patch formats
- Downstream connection: Lacks quantitative association with actual application performance
- Extended formats: Explore tree-structured and AST-level edit representations
- Downstream connection: Establish relationships between benchmark performance and actual application effectiveness
- Reasoning capability: Evaluate multi-step reasoning and tool usage scenarios
- Error recovery: Study handling of partial or corrupted diffs
- Clear problem definition: Accurately identifies diff understanding as a core yet overlooked capability
- Rigorous experimental design: Scientifically sound controlled-variable format comparison methodology
- High data quality: Based on real commits with stratified sampling ensuring representativeness
- Valuable findings: Format selection guidance has direct practical value
- Strong reproducibility: Detailed experimental setup and open dataset
- Limited theoretical depth: Insufficient theoretical analysis of format difference mechanisms
- Single evaluation dimension: Primarily focuses on correctness, lacking efficiency and readability considerations
- Incomplete model coverage: Open-source models primarily concentrated in Qwen series
- Limited application scenarios: Does not consider interactive editing and incremental update scenarios
- Academic value: Fills important gap in code editing evaluation, likely to inspire follow-up research
- Engineering value: Provides data support for industry format selection
- Community contribution: Open benchmark and dataset will benefit the entire research community
- Standardization potential: May become standard benchmark for code editing capability evaluation
- Model development: Capability evaluation and improvement of code editing models
- Format design: Effectiveness verification of new diff formats
- Tool selection: Format strategy selection for code editing tools
- Research foundation: Baseline capability testing for complex code editing tasks
The paper cites 31 related references, primarily including:
- Code generation benchmarks: HumanEval, MBPP, BigCodeBench, etc.
- Editing evaluation: SWE-bench, CodeEditorBench, etc.
- Model technical reports: GPT-4o, Claude, Qwen2.5-Coder, etc.
- Tools and formats: Aider, GNU diffutils, etc.
Overall Assessment: This is a high-quality systematic research paper that accurately identifies and addresses an important problem in code editing. While somewhat limited in theoretical depth, its practical value and methodological contributions are significant, with important implications for advancing code editing technology.