SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation
Chen, Zheng, Huang et al.
Test-time scaling without interpreter feedback is essential for real-world code generation scenarios where test cases are not readily available. While existing paradigms often rely on either greedy exploitation (i.e., iterative refinement) or stochastic exploration (i.e., relying on sample-based voting or reranking mechanisms), the balance between these two dimensions remains underexplored. To investigate the LLM's intrinsic ability to balance exploitation and exploration, we introduce SELF-REDRAFT, a framework built upon Self-Refine that encourages the model to propose new drafts for solutions that are fundamentally flawed. Our results show that SELF-REDRAFT consistently achieves better performance than Self-Refine when converged under the same maximum number of iterations. Still, we observe that significant room for improvement remains, largely due to two core aspects of current self-redraft capabilities: constrained capacity for generating instructive feedback and fragile discriminative judgment. We also find that balancing strategies vary notably across different LLMs, reflecting distinct, model-specific behaviors. Overall, our study establishes a baseline for intrinsic exploration-exploitation balancing in test-time scaling and identifies feedback and discrimination as key areas with potential for future advances.
academic
SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation
This paper investigates the intrinsic capability of large language models (LLMs) to balance exploration and exploitation in code generation tasks during test-time scaling without execution feedback. Existing approaches either rely on greedy exploitation (iterative refinement) or random exploration (sampling-based voting or reranking), but the balance between them remains insufficiently studied. The authors propose the SELF-REDRAFT framework, which augments Self-Refine with a mechanism to redraft fundamentally flawed solutions. Experiments demonstrate that SELF-REDRAFT consistently outperforms Self-Refine under the same iteration budget, yet significant room for improvement remains, primarily constrained by two core capabilities: insufficient ability to generate directive feedback and fragile code discrimination. The study also reveals substantial differences in balancing strategies across different LLMs, reflecting model-specific behavioral characteristics.
This paper addresses code generation in the execution-free test-time scaling scenario. In practical applications, test cases are often unavailable, requiring LLMs to autonomously improve code quality without program execution feedback.
Practical Necessity: Test cases are frequently missing in real-world scenarios, and execution environments may be unavailable
Computational Efficiency: Test-time scaling is an effective means to enhance LLM performance, but requires maximizing performance within limited computational budgets
Theoretical Value: The exploration-exploitation tradeoff is a fundamental problem in reinforcement learning and search algorithms, yet its application in code generation remains insufficiently studied
The authors aim to investigate LLMs' intrinsic ability to balance exploration and exploitation without execution feedback, identify current model bottlenecks, and provide directions for future improvements.
Proposes SELF-REDRAFT Framework: Introduces explicit exploration choices based on Self-Refine, allowing models to redraft fundamentally flawed solutions, achieving exploration-exploitation balance
Establishes Benchmark Evaluation: Systematically evaluates 6 open-source and proprietary LLMs on LiveCodeBench, demonstrating average improvement of 0.615% after 16 iterations
Identifies Core Bottlenecks: Through in-depth analysis, reveals two critical limiting factors:
Insufficient Model Critique capability
Fragile Code Discrimination ability
Reveals Model-Specific Behaviors: Discovers substantial differences in balancing strategies across different LLMs, indicating this capability is not universal but rather a model-specific emergent property
Quantifies Improvement Space: By comparing with pass@8 upper bounds, quantifies the gap between current methods and pure exploration potential
Core Difference from Self-Refine: Self-Refine only supports PASS and REFINE, purely exploitation-oriented. SELF-REDRAFT introduces REDRAFT option, allowing models to identify fundamental errors and redraft solutions.
Design Rationale:
Code problems divide into surface errors (syntax, boundary conditions) and methodological errors (algorithm selection)
Pass@k: Functional correctness metric
pass@k=EProblem[1−(kn)(kn−c)]
where n is number of generated samples and c is number of correct samples. This paper uses n=16,k=8.
Improvement Rate (rimp): Proportion of initially incorrect solutions corrected
Regression Rate (rreg): Proportion of initially correct solutions broken
Recall on Draft: Auxiliary evaluator's recall in correctly identifying "redraft" suggestions
Key Finding: Pure exploration (8 independent samples) more effective than current exploration-exploitation balance
Gap Examples:
GPT-4.1 mini: SELF-REDRAFT 35.1% vs Pass@8 41.8%
Qwen3-Next: SELF-REDRAFT 48.2% vs Pass@8 55.3%
Interpretation: Many problems can be solved through diverse sampling alone, but SELF-REDRAFT fails to effectively leverage this advantage, indicating inefficient current exploration mechanisms.
Improvement Rate vs. Regression Rate Comparison (Table 1):
Model
Self-Refine rimp
SELF-REDRAFT rimp
Self-Refine rreg
SELF-REDRAFT rreg
GPT-4.1 mini
3.29%
5.18% (+1.89)
1.11%
1.27% (+0.16)
GPT-4.1 nano
19.52%
23.02% (+3.50)
1.70%
2.33% (+0.63)
Kimi K2
9.89%
12.99% (+3.10)
1.57%
2.57% (+1.00)
Llama-4-Maverick
4.15%
6.74% (+2.59)
1.68%
3.78% (+2.10)
LongCat-Flash-Chat
18.68%
20.33% (+1.65)
2.69%
3.01% (+0.32)
Qwen3-Next
26.53%
29.34% (+2.81)
0.30%
0.60% (+0.30)
Key Findings:
SELF-REDRAFT achieves higher improvement rates (corrects more errors)
But regression rates also increase significantly (breaks more correct solutions)
Regression rate increases are substantial in some models (e.g., Llama-4-Maverick +2.10%)
Interpretation: Redrafting is a high-risk operation. Due to limited discrimination ability, models frequently misclassify correct solutions as errors and "break" them, offsetting exploration benefits.
SELF-REDRAFT Effective but Limited: Consistently outperforms Self-Refine under same iteration budget, but improvement magnitude is limited (average 0.615%)
Two Major Bottlenecks:
Insufficient Feedback Generation: Models struggle to identify methodological errors, unable to provide effective redraft guidance
This is a solid empirical research paper addressing an important yet overlooked problem in code generation: exploration-exploitation balance under execution-free scenarios. SELF-REDRAFT is elegantly simple, introducing exploration mechanisms through minimal modifications. While absolute improvements are limited (0.615%), the paper's value lies in:
Honest Scientific Attitude: Does not overstate effects, clearly identifies limitations and gaps
In-Depth Mechanism Analysis: Identifies two bottlenecks—feedback and discrimination
Clear Research Roadmap: Provides explicit directions for future work
The paper's primary contribution is not proposing a powerful new method, but rather systematically revealing current LLMs' insufficiencies in autonomous exploration-exploitation balance, equally important for field advancement. For researchers, this provides clear improvement targets; for practitioners, this warns of current method limitations.
Recommended future work focus:
Train stronger critique and discrimination capabilities
Explore integration of external knowledge and tools