2025-11-30T18:52:18.815530

SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation

Chen, Zheng, Huang et al.

Test-time scaling without interpreter feedback is essential for real-world code generation scenarios where test cases are not readily available. While existing paradigms often rely on either greedy exploitation (i.e., iterative refinement) or stochastic exploration (i.e., relying on sample-based voting or reranking mechanisms), the balance between these two dimensions remains underexplored. To investigate the LLM's intrinsic ability to balance exploitation and exploration, we introduce SELF-REDRAFT, a framework built upon Self-Refine that encourages the model to propose new drafts for solutions that are fundamentally flawed. Our results show that SELF-REDRAFT consistently achieves better performance than Self-Refine when converged under the same maximum number of iterations. Still, we observe that significant room for improvement remains, largely due to two core aspects of current self-redraft capabilities: constrained capacity for generating instructive feedback and fragile discriminative judgment. We also find that balancing strategies vary notably across different LLMs, reflecting distinct, model-specific behaviors. Overall, our study establishes a baseline for intrinsic exploration-exploitation balancing in test-time scaling and identifies feedback and discrimination as key areas with potential for future advances.

academic

SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation

Basic Information

Paper ID: 2511.02854
Title: SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation
Authors: Yixiang Chen*, Tianshi Zheng*, Shijue Huang, Zhitao He, Yi R. (May) Fung (*Equal Contribution)
Affiliation: Department of Computer Science and Engineering, HKUST
Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
Submission Date: October 31, 2025
Paper Link: https://arxiv.org/abs/2511.02854v1

Abstract

This paper investigates the intrinsic capability of large language models (LLMs) to balance exploration and exploitation in code generation tasks during test-time scaling without execution feedback. Existing approaches either rely on greedy exploitation (iterative refinement) or random exploration (sampling-based voting or reranking), but the balance between them remains insufficiently studied. The authors propose the SELF-REDRAFT framework, which augments Self-Refine with a mechanism to redraft fundamentally flawed solutions. Experiments demonstrate that SELF-REDRAFT consistently outperforms Self-Refine under the same iteration budget, yet significant room for improvement remains, primarily constrained by two core capabilities: insufficient ability to generate directive feedback and fragile code discrimination. The study also reveals substantial differences in balancing strategies across different LLMs, reflecting model-specific behavioral characteristics.

Research Background and Motivation

1. Problem Statement

This paper addresses code generation in the execution-free test-time scaling scenario. In practical applications, test cases are often unavailable, requiring LLMs to autonomously improve code quality without program execution feedback.

2. Problem Significance

Practical Necessity: Test cases are frequently missing in real-world scenarios, and execution environments may be unavailable
Computational Efficiency: Test-time scaling is an effective means to enhance LLM performance, but requires maximizing performance within limited computational budgets
Theoretical Value: The exploration-exploitation tradeoff is a fundamental problem in reinforcement learning and search algorithms, yet its application in code generation remains insufficiently studied

3. Limitations of Existing Methods

Execution-Dependent Methods: Require test cases and execution environments, limiting applicability in real scenarios
Pure Exploitation Methods (e.g., Self-Refine): Only perform iterative optimization, easily trapped in local optima
Pure Exploration Methods (e.g., pass@k): Obtain diversity through multiple samples but lack targeted improvement
Missing Balance: Existing execution-free methods primarily rely on exploitation, with exploration dimensions neglected

4. Research Motivation

The authors aim to investigate LLMs' intrinsic ability to balance exploration and exploitation without execution feedback, identify current model bottlenecks, and provide directions for future improvements.

Core Contributions

Proposes SELF-REDRAFT Framework: Introduces explicit exploration choices based on Self-Refine, allowing models to redraft fundamentally flawed solutions, achieving exploration-exploitation balance
Establishes Benchmark Evaluation: Systematically evaluates 6 open-source and proprietary LLMs on LiveCodeBench, demonstrating average improvement of 0.615% after 16 iterations
Identifies Core Bottlenecks: Through in-depth analysis, reveals two critical limiting factors:
- Insufficient Model Critique capability
- Fragile Code Discrimination ability
Reveals Model-Specific Behaviors: Discovers substantial differences in balancing strategies across different LLMs, indicating this capability is not universal but rather a model-specific emergent property
Quantifies Improvement Space: By comparing with pass@8 upper bounds, quantifies the gap between current methods and pure exploration potential

Methodology Details

Task Definition

Input: Programming task description $x$
Output: Code solution $\hat{y}$ satisfying task requirements
Objective: Maximize code functional correctness through limited iterations (test-time computation) without test case execution feedback

Model Architecture

SELF-REDRAFT is an iterative framework comprising three main steps:

Step 0: Initialization

Given task $x$ and generation prompt $p_{gen}$ , the model generates an initial solution: $y_0 \sim \pi(\cdot | p_{gen}, x)$

Step 1: Feedback Generation

The model evaluates current solution $y_i$ using feedback prompt $p_{fb}$ to generate feedback $c_i$ : $c_i \sim \pi(\cdot | p_{fb}, x, y_i)$

Feedback comprises two components:

Critique: Analyzes code issues and provides specific suggestions
Action Suggestion: Explicitly indicates next steps, including three options:
- PASS: Code is correct, stop iteration
- REFINE: Minor improvement, maintain original approach
- REDRAFT: Fundamental error, requires new approach

Step 2: Regeneration

Based on feedback and historical trajectory, the model generates a new solution: $y_{i+1} \sim \pi(\cdot | p_{regen}, x, y_i, c_i, \ldots, y_0, c_0)$

According to feedback suggestions:

If REDRAFT: Generate entirely new solution (exploration)
If REFINE: Improve based on original solution (exploitation)

Iteration continues until stopping conditions are met (maximum iterations $T$ reached or model outputs PASS).

Technical Innovations

1. Explicit Exploration Mechanism

Core Difference from Self-Refine: Self-Refine only supports PASS and REFINE, purely exploitation-oriented. SELF-REDRAFT introduces REDRAFT option, allowing models to identify fundamental errors and redraft solutions.

Design Rationale:

Code problems divide into surface errors (syntax, boundary conditions) and methodological errors (algorithm selection)
Surface errors suit progressive optimization (refine), methodological errors require rethinking (redraft)
By allowing models to autonomously judge error types, dynamic exploration-exploitation balance is achieved

2. Structured Feedback Design

Uses XML tags to enforce structured output:

<critique>
Detailed criticism and analysis
</critique>
<suggestion>
pass/refine/redraft
</suggestion>

This design facilitates:

Information extraction and algorithmic decision-making
Subsequent experimental analysis
Ensuring feedback actionability

3. Trajectory Memory Mechanism

Regeneration includes complete historical trajectory $(y_0, c_0, \ldots, y_i, c_i)$ , enabling models to:

Avoid repeated errors
Learn improvement patterns
Retain valid information during exploration

Experimental Setup

Dataset

LiveCodeBench (Jain et al., 2024):

Scale: 1,055 programming problems
Difficulty Levels: easy, medium, hard
Characteristics:
- Comprehensive and uncontaminated evaluation benchmark
- Derived from real programming competitions
- Continuously updated to prevent training data leakage

Evaluation Metrics

Pass@k: Functional correctness metric $\text{pass@k} = \mathbb{E}_{\text{Problem}}\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right]$ where $n$ is number of generated samples and $c$ is number of correct samples. This paper uses $n=16, k=8$ .
Improvement Rate ( $r_{imp}$ ): Proportion of initially incorrect solutions corrected
Regression Rate ( $r_{reg}$ ): Proportion of initially correct solutions broken
Recall on Draft: Auxiliary evaluator's recall in correctly identifying "redraft" suggestions

Comparison Methods

Self-Refine: Pure exploitation baseline supporting only iterative refinement
Pass@8: Pure exploration upper bound through independent sampling

Implementation Details

Model Configuration (6 LLMs):

GPT-4.1 mini, GPT-4.1 nano (OpenAI)
Kimi K2 (32B active parameters, 1T total parameters MoE)
Llama 4 Maverick (17B active parameters, 128-expert MoE)
LongCat-Flash-Chat (MoE, specialized for agent tasks)
Qwen3-Next-80B-A3B-Instruct

Generation Parameters (following LiveCodeBench defaults):

Temperature: 0.2
Top-p: 0.95
Frequency penalty: 0
Presence penalty: 0

Iteration Settings:

Maximum iterations: 16
Same initial solution set for fair comparison
Early stopping allowed (when model outputs PASS)

Experimental Results

Main Results

Overall Performance (Figure 2, complete table results in Appendix E):

SELF-REDRAFT achieves average improvement of 0.615% over Self-Refine after 16 iterations
Improvements consistently appear across all 6 tested models
Performance stabilizes at 16 iterations

Per-Model Performance (Figure 8):

Substantial absolute performance differences across models
Diverse iteration curve shapes reflecting different balancing strategies
Some models reach peak performance in early iterations with subsequent fluctuations

Unexploited Exploration Potential

Comparison with pass@8 Upper Bound (Figure 3):

Pass@8 significantly outperforms SELF-REDRAFT×16 (17 solutions)
Key Finding: Pure exploration (8 independent samples) more effective than current exploration-exploitation balance
Gap Examples:
- GPT-4.1 mini: SELF-REDRAFT 35.1% vs Pass@8 41.8%
- Qwen3-Next: SELF-REDRAFT 48.2% vs Pass@8 55.3%

Interpretation: Many problems can be solved through diverse sampling alone, but SELF-REDRAFT fails to effectively leverage this advantage, indicating inefficient current exploration mechanisms.

Feedback Quality Analysis

Blind Evaluation Experimental Design (Section 3.3):

Sample (original solution, feedback, new solution) triplets from trajectories
Auxiliary evaluator judges whether methodological change occurred based only on solution pairs
Compare evaluator judgment with original feedback suggestions (refine vs redraft)
Balanced sampling: each group contains equal numbers of "draft" and "refine" labels
Maximum 1000 samples per generator model

Recall on Draft Results (Figure 5):

Average recall rate: 30-55% range
Positive Correlation Finding (Figure 4): Recall on Draft positively correlates with SELF-REDRAFT improvement magnitude (correlation coefficient ~0.6-0.7)
Cross-Evaluator Consistency (Figure 7): High ranking consistency across different auxiliary models (Spearman ρ > 0.8)

Core Conclusion: Most models cannot provide actionable feedback for methodological correction, limiting effective exploration.

Code Discrimination Ability Analysis

Improvement Rate vs. Regression Rate Comparison (Table 1):

Model	Self-Refine $r_{imp}$	SELF-REDRAFT $r_{imp}$	Self-Refine $r_{reg}$	SELF-REDRAFT $r_{reg}$
GPT-4.1 mini	3.29%	5.18% (+1.89)	1.11%	1.27% (+0.16)
GPT-4.1 nano	19.52%	23.02% (+3.50)	1.70%	2.33% (+0.63)
Kimi K2	9.89%	12.99% (+3.10)	1.57%	2.57% (+1.00)
Llama-4-Maverick	4.15%	6.74% (+2.59)	1.68%	3.78% (+2.10)
LongCat-Flash-Chat	18.68%	20.33% (+1.65)	2.69%	3.01% (+0.32)
Qwen3-Next	26.53%	29.34% (+2.81)	0.30%	0.60% (+0.30)

Key Findings:

SELF-REDRAFT achieves higher improvement rates (corrects more errors)
But regression rates also increase significantly (breaks more correct solutions)
Regression rate increases are substantial in some models (e.g., Llama-4-Maverick +2.10%)

Interpretation: Redrafting is a high-risk operation. Due to limited discrimination ability, models frequently misclassify correct solutions as errors and "break" them, offsetting exploration benefits.

Cross-Model Behavioral Differences

Balancing Strategy Differences (Figure 6):

Butterfly plot displays "refine" vs "redraft" suggestion counts across 16 iterations per model
Massive Differences:
- Some models favor "refine" (exploitation-oriented)
- Some models favor "redraft" (exploration-oriented)
- No unified pattern

Implications: Exploration-exploitation balance is not a universal capability but rather a model-specific emergent property, reflecting:

Pretraining data differences
Model architecture influences
Different instruction-tuning strategies

Case Study

Complete Case in Appendix F:

Task: LeetCode-style array manipulation problem
Original Solution: Confused logic with multiple conceptual errors
Feedback: Detailed identification of 5 specific issues, recommending "redraft"
New Solution: Adopts completely different dynamic programming approach, correctly solves problem

Observations:

When feedback quality is high, redraft effectively escapes erroneous methods
New solution demonstrates problem re-understanding
However, such high-quality feedback is not the norm in experiments

1. Test-Time Scaling Methods

Execution-Dependent:

Self-Debug (Chen et al., 2023): Iterative debugging using execution feedback
Reflexion (Shinn et al., 2023): Reinforcement learning-based language agents
AIDE (Jiang et al., 2025): AI-driven exploration in code space
S* (Li et al., 2025): Test-time search methods

Execution-Independent:

Self-Refine (Madaan et al., 2023): Pure exploitation self-optimization
SETS (Chen et al., 2025): Self-verification and self-correction

2. Exploration-Exploitation Tradeoff

Tang et al. (2024): Modeling LLM code repair as exploration-exploitation tradeoff
This paper's distinction: Focuses on execution-free scenarios, studying intrinsic balancing capability

3. LLM Feedback Capability

Zheng et al. (2024): Reasoning mechanisms in multi-turn code generation
Xie et al. (2025): Teaching LLMs to critique through reinforcement learning
This paper's contribution: Quantifies feedback quality impact on exploration effectiveness

4. Code Generation Evaluation

LiveCodeBench (Jain et al., 2024): Uncontaminated comprehensive evaluation
Pass@k metric (Kulal et al., 2019; Chen et al., 2021)

Conclusions and Discussion

Main Conclusions

SELF-REDRAFT Effective but Limited: Consistently outperforms Self-Refine under same iteration budget, but improvement magnitude is limited (average 0.615%)
Two Major Bottlenecks:
- Insufficient Feedback Generation: Models struggle to identify methodological errors, unable to provide effective redraft guidance
- Fragile Code Discrimination: Misclassification causes harmful redrafts, increased regression rates offset benefits
Model-Specific Nature: Balancing strategies vary dramatically across different LLMs, not a universal capability
Massive Potential: Gap with pass@8 upper bound indicates substantial unexploited exploration space

Limitations

Explicitly Stated by Authors:

Execution-Free Paradigm:
- Research scope limited to execution-free scenarios
- Not directly comparable with execution-dependent methods
- Hybrid approaches are future direction
Benchmark Generalization:
- Evaluation only on LiveCodeBench
- Generalization to other programming languages and domains requires verification
Intrinsic Capability Dependence:
- Performance constrained by pretrained model inherent capabilities
- Training-driven improvements (e.g., fine-tuning critique ability) not explored
- Non-intrinsic exploration strategies not investigated

Future Directions

Research Directions Proposed by Paper:

Improve Feedback Generation:
- Train specialized critique models
- Design more effective feedback prompts
- Introduce external knowledge for diagnosis assistance
Enhance Discrimination Ability:
- Improve code correctness judgment reliability
- Reduce harmful redrafts
- May require specialized verifiers
Model-Adaptive Strategies:
- Design customized balancing strategies for different models
- Dynamically adjust exploration-exploitation ratios
- Learn optimal stopping timing
Hybrid Methods:
- Combine execution feedback with intrinsic capabilities
- Optimal strategies with limited test cases

In-Depth Evaluation

Strengths

1. Clear and Important Problem Definition

Focuses on practical scenarios (no test cases)
Exploration-exploitation tradeoff is classical, novel application in code generation
Studying intrinsic capability rather than external tools provides high theoretical value

2. Simple and Effective Method Design

Minimal modifications to Self-Refine, clear comparison
Three-option design (pass/refine/redraft) intuitive and actionable
Structured feedback facilitates analysis

3. Rigorous Experimental Design

Fair Comparison: Uses identical initial solutions
Multi-Model Validation: 6 LLMs of different scales and architectures
Multi-Dimensional Analysis: Performance, feedback quality, discrimination ability, cross-model differences
Blind Evaluation Design: Avoids bias, uses auxiliary models for verification

4. In-Depth and Honest Analysis

Reports not only improvements but honestly points out limitations
Quantifies gap with upper bounds, clarifies improvement space
Identifies specific bottlenecks (feedback, discrimination) rather than vague conclusions
Reveals model-specificity, avoids overgeneralization

5. Strong Reproducibility

Detailed algorithm pseudocode (Algorithm 1)
Complete prompt templates (Appendix A.2)
Clear model configurations and hyperparameters (Appendix C)
Commits to open-sourcing code

Weaknesses

1. Limited Improvement Magnitude

Average 0.615% improvement is modest, statistical significance not clearly reported
Some models may fall within noise margins
Requires more experiments to verify stability

2. Limited Evaluation Scope

Only LiveCodeBench benchmark
Untested on other programming languages (beyond Python)
Other code quality dimensions (readability, efficiency) not evaluated

3. Lack of Theoretical Analysis

Why is 0.615% a reasonable expectation?
What is optimal exploration-exploitation ratio?
Missing formal theoretical framework

4. Stopping Condition Design Impact Insufficiently Discussed

Model autonomously deciding when to PASS may introduce bias
Different models' early stopping rates not reported
May affect fairness

5. Absence of Human Evaluation

All evaluations rely on automatic metrics and model judgment
Human perspective on feedback quality and code quality missing
Blind evaluation uses models rather than humans

6. Computational Cost Not Discussed

Actual cost of 16 iterations?
Cost comparison with pass@16?
Practical utility assessment insufficient

Impact

Contribution to Field

Opens New Research Direction: Establishes benchmark for exploration-exploitation balance in execution-free scenarios
Identifies Key Bottlenecks: Clarifies feedback and discrimination as core limitations
Inspires Future Work: Provides clear improvement pathways

Practical Value

Moderate: Current improvements limited but direction clear
Suitable for scenarios where test cases unavailable
Can complement execution-dependent methods

Reproducibility

High: Detailed method description, prompt templates, configurations
Code to be open-sourced
Uses public benchmarks and API-accessible models

Applicable Scenarios

Suitable Scenarios:

Code generation without test cases (early development stages)
Execution environment unavailable or costly
Exploratory programming requiring diverse solutions
Preprocessing step for execution-dependent methods

Unsuitable Scenarios:

Abundant test cases available (execution-dependent methods superior)
Extremely high accuracy requirements for critical code
Severely limited computational budgets (small improvement margins)
Scenarios requiring guaranteed monotonic improvement (regression risk exists)

Key References

Madaan et al. (2023) - Self-Refine: Foundation method of this paper
Jain et al. (2024) - LiveCodeBench: Evaluation benchmark
Tang et al. (2024) - Exploration-exploitation tradeoff in code repair
Xie et al. (2025) - Improving critique ability through RL
Chen et al. (2021) - Codex and pass@k metric
Snell et al. (2024) - Theoretical foundations of test-time compute scaling

Summary

This is a solid empirical research paper addressing an important yet overlooked problem in code generation: exploration-exploitation balance under execution-free scenarios. SELF-REDRAFT is elegantly simple, introducing exploration mechanisms through minimal modifications. While absolute improvements are limited (0.615%), the paper's value lies in:

Honest Scientific Attitude: Does not overstate effects, clearly identifies limitations and gaps
In-Depth Mechanism Analysis: Identifies two bottlenecks—feedback and discrimination
Clear Research Roadmap: Provides explicit directions for future work

The paper's primary contribution is not proposing a powerful new method, but rather systematically revealing current LLMs' insufficiencies in autonomous exploration-exploitation balance, equally important for field advancement. For researchers, this provides clear improvement targets; for practitioners, this warns of current method limitations.

Recommended future work focus:

Train stronger critique and discrimination capabilities
Explore integration of external knowledge and tools
Research model-adaptive balancing strategies
Validate across more benchmarks and scenarios