2025-11-30T18:52:18.815530

SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation

Chen, Zheng, Huang et al.
Test-time scaling without interpreter feedback is essential for real-world code generation scenarios where test cases are not readily available. While existing paradigms often rely on either greedy exploitation (i.e., iterative refinement) or stochastic exploration (i.e., relying on sample-based voting or reranking mechanisms), the balance between these two dimensions remains underexplored. To investigate the LLM's intrinsic ability to balance exploitation and exploration, we introduce SELF-REDRAFT, a framework built upon Self-Refine that encourages the model to propose new drafts for solutions that are fundamentally flawed. Our results show that SELF-REDRAFT consistently achieves better performance than Self-Refine when converged under the same maximum number of iterations. Still, we observe that significant room for improvement remains, largely due to two core aspects of current self-redraft capabilities: constrained capacity for generating instructive feedback and fragile discriminative judgment. We also find that balancing strategies vary notably across different LLMs, reflecting distinct, model-specific behaviors. Overall, our study establishes a baseline for intrinsic exploration-exploitation balancing in test-time scaling and identifies feedback and discrimination as key areas with potential for future advances.
academic

SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation

Basic Information

  • Paper ID: 2511.02854
  • Title: SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation
  • Authors: Yixiang Chen*, Tianshi Zheng*, Shijue Huang, Zhitao He, Yi R. (May) Fung (*Equal Contribution)
  • Affiliation: Department of Computer Science and Engineering, HKUST
  • Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
  • Submission Date: October 31, 2025
  • Paper Link: https://arxiv.org/abs/2511.02854v1

Abstract

This paper investigates the intrinsic capability of large language models (LLMs) to balance exploration and exploitation in code generation tasks during test-time scaling without execution feedback. Existing approaches either rely on greedy exploitation (iterative refinement) or random exploration (sampling-based voting or reranking), but the balance between them remains insufficiently studied. The authors propose the SELF-REDRAFT framework, which augments Self-Refine with a mechanism to redraft fundamentally flawed solutions. Experiments demonstrate that SELF-REDRAFT consistently outperforms Self-Refine under the same iteration budget, yet significant room for improvement remains, primarily constrained by two core capabilities: insufficient ability to generate directive feedback and fragile code discrimination. The study also reveals substantial differences in balancing strategies across different LLMs, reflecting model-specific behavioral characteristics.

Research Background and Motivation

1. Problem Statement

This paper addresses code generation in the execution-free test-time scaling scenario. In practical applications, test cases are often unavailable, requiring LLMs to autonomously improve code quality without program execution feedback.

2. Problem Significance

  • Practical Necessity: Test cases are frequently missing in real-world scenarios, and execution environments may be unavailable
  • Computational Efficiency: Test-time scaling is an effective means to enhance LLM performance, but requires maximizing performance within limited computational budgets
  • Theoretical Value: The exploration-exploitation tradeoff is a fundamental problem in reinforcement learning and search algorithms, yet its application in code generation remains insufficiently studied

3. Limitations of Existing Methods

  • Execution-Dependent Methods: Require test cases and execution environments, limiting applicability in real scenarios
  • Pure Exploitation Methods (e.g., Self-Refine): Only perform iterative optimization, easily trapped in local optima
  • Pure Exploration Methods (e.g., pass@k): Obtain diversity through multiple samples but lack targeted improvement
  • Missing Balance: Existing execution-free methods primarily rely on exploitation, with exploration dimensions neglected

4. Research Motivation

The authors aim to investigate LLMs' intrinsic ability to balance exploration and exploitation without execution feedback, identify current model bottlenecks, and provide directions for future improvements.

Core Contributions

  1. Proposes SELF-REDRAFT Framework: Introduces explicit exploration choices based on Self-Refine, allowing models to redraft fundamentally flawed solutions, achieving exploration-exploitation balance
  2. Establishes Benchmark Evaluation: Systematically evaluates 6 open-source and proprietary LLMs on LiveCodeBench, demonstrating average improvement of 0.615% after 16 iterations
  3. Identifies Core Bottlenecks: Through in-depth analysis, reveals two critical limiting factors:
    • Insufficient Model Critique capability
    • Fragile Code Discrimination ability
  4. Reveals Model-Specific Behaviors: Discovers substantial differences in balancing strategies across different LLMs, indicating this capability is not universal but rather a model-specific emergent property
  5. Quantifies Improvement Space: By comparing with pass@8 upper bounds, quantifies the gap between current methods and pure exploration potential

Methodology Details

Task Definition

Input: Programming task description xx
Output: Code solution y^\hat{y} satisfying task requirements
Objective: Maximize code functional correctness through limited iterations (test-time computation) without test case execution feedback

Model Architecture

SELF-REDRAFT is an iterative framework comprising three main steps:

Step 0: Initialization

Given task xx and generation prompt pgenp_{gen}, the model generates an initial solution: y0π(pgen,x)y_0 \sim \pi(\cdot | p_{gen}, x)

Step 1: Feedback Generation

The model evaluates current solution yiy_i using feedback prompt pfbp_{fb} to generate feedback cic_i: ciπ(pfb,x,yi)c_i \sim \pi(\cdot | p_{fb}, x, y_i)

Feedback comprises two components:

  • Critique: Analyzes code issues and provides specific suggestions
  • Action Suggestion: Explicitly indicates next steps, including three options:
    • PASS: Code is correct, stop iteration
    • REFINE: Minor improvement, maintain original approach
    • REDRAFT: Fundamental error, requires new approach

Step 2: Regeneration

Based on feedback and historical trajectory, the model generates a new solution: yi+1π(pregen,x,yi,ci,,y0,c0)y_{i+1} \sim \pi(\cdot | p_{regen}, x, y_i, c_i, \ldots, y_0, c_0)

According to feedback suggestions:

  • If REDRAFT: Generate entirely new solution (exploration)
  • If REFINE: Improve based on original solution (exploitation)

Iteration continues until stopping conditions are met (maximum iterations TT reached or model outputs PASS).

Technical Innovations

1. Explicit Exploration Mechanism

Core Difference from Self-Refine: Self-Refine only supports PASS and REFINE, purely exploitation-oriented. SELF-REDRAFT introduces REDRAFT option, allowing models to identify fundamental errors and redraft solutions.

Design Rationale:

  • Code problems divide into surface errors (syntax, boundary conditions) and methodological errors (algorithm selection)
  • Surface errors suit progressive optimization (refine), methodological errors require rethinking (redraft)
  • By allowing models to autonomously judge error types, dynamic exploration-exploitation balance is achieved

2. Structured Feedback Design

Uses XML tags to enforce structured output:

<critique>
Detailed criticism and analysis
</critique>
<suggestion>
pass/refine/redraft
</suggestion>

This design facilitates:

  • Information extraction and algorithmic decision-making
  • Subsequent experimental analysis
  • Ensuring feedback actionability

3. Trajectory Memory Mechanism

Regeneration includes complete historical trajectory (y0,c0,,yi,ci)(y_0, c_0, \ldots, y_i, c_i), enabling models to:

  • Avoid repeated errors
  • Learn improvement patterns
  • Retain valid information during exploration

Experimental Setup

Dataset

LiveCodeBench (Jain et al., 2024):

  • Scale: 1,055 programming problems
  • Difficulty Levels: easy, medium, hard
  • Characteristics:
    • Comprehensive and uncontaminated evaluation benchmark
    • Derived from real programming competitions
    • Continuously updated to prevent training data leakage

Evaluation Metrics

  1. Pass@k: Functional correctness metric pass@k=EProblem[1(nck)(nk)]\text{pass@k} = \mathbb{E}_{\text{Problem}}\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right] where nn is number of generated samples and cc is number of correct samples. This paper uses n=16,k=8n=16, k=8.
  2. Improvement Rate (rimpr_{imp}): Proportion of initially incorrect solutions corrected
  3. Regression Rate (rregr_{reg}): Proportion of initially correct solutions broken
  4. Recall on Draft: Auxiliary evaluator's recall in correctly identifying "redraft" suggestions

Comparison Methods

  • Self-Refine: Pure exploitation baseline supporting only iterative refinement
  • Pass@8: Pure exploration upper bound through independent sampling

Implementation Details

Model Configuration (6 LLMs):

  • GPT-4.1 mini, GPT-4.1 nano (OpenAI)
  • Kimi K2 (32B active parameters, 1T total parameters MoE)
  • Llama 4 Maverick (17B active parameters, 128-expert MoE)
  • LongCat-Flash-Chat (MoE, specialized for agent tasks)
  • Qwen3-Next-80B-A3B-Instruct

Generation Parameters (following LiveCodeBench defaults):

  • Temperature: 0.2
  • Top-p: 0.95
  • Frequency penalty: 0
  • Presence penalty: 0

Iteration Settings:

  • Maximum iterations: 16
  • Same initial solution set for fair comparison
  • Early stopping allowed (when model outputs PASS)

Experimental Results

Main Results

Overall Performance (Figure 2, complete table results in Appendix E):

  • SELF-REDRAFT achieves average improvement of 0.615% over Self-Refine after 16 iterations
  • Improvements consistently appear across all 6 tested models
  • Performance stabilizes at 16 iterations

Per-Model Performance (Figure 8):

  • Substantial absolute performance differences across models
  • Diverse iteration curve shapes reflecting different balancing strategies
  • Some models reach peak performance in early iterations with subsequent fluctuations

Unexploited Exploration Potential

Comparison with pass@8 Upper Bound (Figure 3):

  • Pass@8 significantly outperforms SELF-REDRAFT×16 (17 solutions)
  • Key Finding: Pure exploration (8 independent samples) more effective than current exploration-exploitation balance
  • Gap Examples:
    • GPT-4.1 mini: SELF-REDRAFT 35.1% vs Pass@8 41.8%
    • Qwen3-Next: SELF-REDRAFT 48.2% vs Pass@8 55.3%

Interpretation: Many problems can be solved through diverse sampling alone, but SELF-REDRAFT fails to effectively leverage this advantage, indicating inefficient current exploration mechanisms.

Feedback Quality Analysis

Blind Evaluation Experimental Design (Section 3.3):

  • Sample (original solution, feedback, new solution) triplets from trajectories
  • Auxiliary evaluator judges whether methodological change occurred based only on solution pairs
  • Compare evaluator judgment with original feedback suggestions (refine vs redraft)
  • Balanced sampling: each group contains equal numbers of "draft" and "refine" labels
  • Maximum 1000 samples per generator model

Recall on Draft Results (Figure 5):

  • Average recall rate: 30-55% range
  • Positive Correlation Finding (Figure 4): Recall on Draft positively correlates with SELF-REDRAFT improvement magnitude (correlation coefficient ~0.6-0.7)
  • Cross-Evaluator Consistency (Figure 7): High ranking consistency across different auxiliary models (Spearman ρ > 0.8)

Core Conclusion: Most models cannot provide actionable feedback for methodological correction, limiting effective exploration.

Code Discrimination Ability Analysis

Improvement Rate vs. Regression Rate Comparison (Table 1):

ModelSelf-Refine rimpr_{imp}SELF-REDRAFT rimpr_{imp}Self-Refine rregr_{reg}SELF-REDRAFT rregr_{reg}
GPT-4.1 mini3.29%5.18% (+1.89)1.11%1.27% (+0.16)
GPT-4.1 nano19.52%23.02% (+3.50)1.70%2.33% (+0.63)
Kimi K29.89%12.99% (+3.10)1.57%2.57% (+1.00)
Llama-4-Maverick4.15%6.74% (+2.59)1.68%3.78% (+2.10)
LongCat-Flash-Chat18.68%20.33% (+1.65)2.69%3.01% (+0.32)
Qwen3-Next26.53%29.34% (+2.81)0.30%0.60% (+0.30)

Key Findings:

  1. SELF-REDRAFT achieves higher improvement rates (corrects more errors)
  2. But regression rates also increase significantly (breaks more correct solutions)
  3. Regression rate increases are substantial in some models (e.g., Llama-4-Maverick +2.10%)

Interpretation: Redrafting is a high-risk operation. Due to limited discrimination ability, models frequently misclassify correct solutions as errors and "break" them, offsetting exploration benefits.

Cross-Model Behavioral Differences

Balancing Strategy Differences (Figure 6):

  • Butterfly plot displays "refine" vs "redraft" suggestion counts across 16 iterations per model
  • Massive Differences:
    • Some models favor "refine" (exploitation-oriented)
    • Some models favor "redraft" (exploration-oriented)
    • No unified pattern

Implications: Exploration-exploitation balance is not a universal capability but rather a model-specific emergent property, reflecting:

  • Pretraining data differences
  • Model architecture influences
  • Different instruction-tuning strategies

Case Study

Complete Case in Appendix F:

  • Task: LeetCode-style array manipulation problem
  • Original Solution: Confused logic with multiple conceptual errors
  • Feedback: Detailed identification of 5 specific issues, recommending "redraft"
  • New Solution: Adopts completely different dynamic programming approach, correctly solves problem

Observations:

  • When feedback quality is high, redraft effectively escapes erroneous methods
  • New solution demonstrates problem re-understanding
  • However, such high-quality feedback is not the norm in experiments

1. Test-Time Scaling Methods

Execution-Dependent:

  • Self-Debug (Chen et al., 2023): Iterative debugging using execution feedback
  • Reflexion (Shinn et al., 2023): Reinforcement learning-based language agents
  • AIDE (Jiang et al., 2025): AI-driven exploration in code space
  • S* (Li et al., 2025): Test-time search methods

Execution-Independent:

  • Self-Refine (Madaan et al., 2023): Pure exploitation self-optimization
  • SETS (Chen et al., 2025): Self-verification and self-correction

2. Exploration-Exploitation Tradeoff

  • Tang et al. (2024): Modeling LLM code repair as exploration-exploitation tradeoff
  • This paper's distinction: Focuses on execution-free scenarios, studying intrinsic balancing capability

3. LLM Feedback Capability

  • Zheng et al. (2024): Reasoning mechanisms in multi-turn code generation
  • Xie et al. (2025): Teaching LLMs to critique through reinforcement learning
  • This paper's contribution: Quantifies feedback quality impact on exploration effectiveness

4. Code Generation Evaluation

  • LiveCodeBench (Jain et al., 2024): Uncontaminated comprehensive evaluation
  • Pass@k metric (Kulal et al., 2019; Chen et al., 2021)

Conclusions and Discussion

Main Conclusions

  1. SELF-REDRAFT Effective but Limited: Consistently outperforms Self-Refine under same iteration budget, but improvement magnitude is limited (average 0.615%)
  2. Two Major Bottlenecks:
    • Insufficient Feedback Generation: Models struggle to identify methodological errors, unable to provide effective redraft guidance
    • Fragile Code Discrimination: Misclassification causes harmful redrafts, increased regression rates offset benefits
  3. Model-Specific Nature: Balancing strategies vary dramatically across different LLMs, not a universal capability
  4. Massive Potential: Gap with pass@8 upper bound indicates substantial unexploited exploration space

Limitations

Explicitly Stated by Authors:

  1. Execution-Free Paradigm:
    • Research scope limited to execution-free scenarios
    • Not directly comparable with execution-dependent methods
    • Hybrid approaches are future direction
  2. Benchmark Generalization:
    • Evaluation only on LiveCodeBench
    • Generalization to other programming languages and domains requires verification
  3. Intrinsic Capability Dependence:
    • Performance constrained by pretrained model inherent capabilities
    • Training-driven improvements (e.g., fine-tuning critique ability) not explored
    • Non-intrinsic exploration strategies not investigated

Future Directions

Research Directions Proposed by Paper:

  1. Improve Feedback Generation:
    • Train specialized critique models
    • Design more effective feedback prompts
    • Introduce external knowledge for diagnosis assistance
  2. Enhance Discrimination Ability:
    • Improve code correctness judgment reliability
    • Reduce harmful redrafts
    • May require specialized verifiers
  3. Model-Adaptive Strategies:
    • Design customized balancing strategies for different models
    • Dynamically adjust exploration-exploitation ratios
    • Learn optimal stopping timing
  4. Hybrid Methods:
    • Combine execution feedback with intrinsic capabilities
    • Optimal strategies with limited test cases

In-Depth Evaluation

Strengths

1. Clear and Important Problem Definition

  • Focuses on practical scenarios (no test cases)
  • Exploration-exploitation tradeoff is classical, novel application in code generation
  • Studying intrinsic capability rather than external tools provides high theoretical value

2. Simple and Effective Method Design

  • Minimal modifications to Self-Refine, clear comparison
  • Three-option design (pass/refine/redraft) intuitive and actionable
  • Structured feedback facilitates analysis

3. Rigorous Experimental Design

  • Fair Comparison: Uses identical initial solutions
  • Multi-Model Validation: 6 LLMs of different scales and architectures
  • Multi-Dimensional Analysis: Performance, feedback quality, discrimination ability, cross-model differences
  • Blind Evaluation Design: Avoids bias, uses auxiliary models for verification

4. In-Depth and Honest Analysis

  • Reports not only improvements but honestly points out limitations
  • Quantifies gap with upper bounds, clarifies improvement space
  • Identifies specific bottlenecks (feedback, discrimination) rather than vague conclusions
  • Reveals model-specificity, avoids overgeneralization

5. Strong Reproducibility

  • Detailed algorithm pseudocode (Algorithm 1)
  • Complete prompt templates (Appendix A.2)
  • Clear model configurations and hyperparameters (Appendix C)
  • Commits to open-sourcing code

Weaknesses

1. Limited Improvement Magnitude

  • Average 0.615% improvement is modest, statistical significance not clearly reported
  • Some models may fall within noise margins
  • Requires more experiments to verify stability

2. Limited Evaluation Scope

  • Only LiveCodeBench benchmark
  • Untested on other programming languages (beyond Python)
  • Other code quality dimensions (readability, efficiency) not evaluated

3. Lack of Theoretical Analysis

  • Why is 0.615% a reasonable expectation?
  • What is optimal exploration-exploitation ratio?
  • Missing formal theoretical framework

4. Stopping Condition Design Impact Insufficiently Discussed

  • Model autonomously deciding when to PASS may introduce bias
  • Different models' early stopping rates not reported
  • May affect fairness

5. Absence of Human Evaluation

  • All evaluations rely on automatic metrics and model judgment
  • Human perspective on feedback quality and code quality missing
  • Blind evaluation uses models rather than humans

6. Computational Cost Not Discussed

  • Actual cost of 16 iterations?
  • Cost comparison with pass@16?
  • Practical utility assessment insufficient

Impact

Contribution to Field

  1. Opens New Research Direction: Establishes benchmark for exploration-exploitation balance in execution-free scenarios
  2. Identifies Key Bottlenecks: Clarifies feedback and discrimination as core limitations
  3. Inspires Future Work: Provides clear improvement pathways

Practical Value

  • Moderate: Current improvements limited but direction clear
  • Suitable for scenarios where test cases unavailable
  • Can complement execution-dependent methods

Reproducibility

  • High: Detailed method description, prompt templates, configurations
  • Code to be open-sourced
  • Uses public benchmarks and API-accessible models

Applicable Scenarios

Suitable Scenarios:

  1. Code generation without test cases (early development stages)
  2. Execution environment unavailable or costly
  3. Exploratory programming requiring diverse solutions
  4. Preprocessing step for execution-dependent methods

Unsuitable Scenarios:

  1. Abundant test cases available (execution-dependent methods superior)
  2. Extremely high accuracy requirements for critical code
  3. Severely limited computational budgets (small improvement margins)
  4. Scenarios requiring guaranteed monotonic improvement (regression risk exists)

Key References

  1. Madaan et al. (2023) - Self-Refine: Foundation method of this paper
  2. Jain et al. (2024) - LiveCodeBench: Evaluation benchmark
  3. Tang et al. (2024) - Exploration-exploitation tradeoff in code repair
  4. Xie et al. (2025) - Improving critique ability through RL
  5. Chen et al. (2021) - Codex and pass@k metric
  6. Snell et al. (2024) - Theoretical foundations of test-time compute scaling

Summary

This is a solid empirical research paper addressing an important yet overlooked problem in code generation: exploration-exploitation balance under execution-free scenarios. SELF-REDRAFT is elegantly simple, introducing exploration mechanisms through minimal modifications. While absolute improvements are limited (0.615%), the paper's value lies in:

  1. Honest Scientific Attitude: Does not overstate effects, clearly identifies limitations and gaps
  2. In-Depth Mechanism Analysis: Identifies two bottlenecks—feedback and discrimination
  3. Clear Research Roadmap: Provides explicit directions for future work

The paper's primary contribution is not proposing a powerful new method, but rather systematically revealing current LLMs' insufficiencies in autonomous exploration-exploitation balance, equally important for field advancement. For researchers, this provides clear improvement targets; for practitioners, this warns of current method limitations.

Recommended future work focus:

  1. Train stronger critique and discrimination capabilities
  2. Explore integration of external knowledge and tools
  3. Research model-adaptive balancing strategies
  4. Validate across more benchmarks and scenarios