2025-11-18T10:58:12.748063

LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Wu, Verma, Lee et al.
Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.
academic

LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Basic Information

  • Paper ID: 2510.13907
  • Title: LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
  • Authors: Yuanchen Wu, Saurabh Verma, Justin Lee, Fangzhou Xiong, Poppy Zhang, Amel Awadelkarim, Xu Chen, Yubai Yuan, Shawndra Hill
  • Categories: cs.CL (Computational Linguistics), stat.ML (Machine Learning)
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.13907

Abstract

Large language models (LLMs) are highly sensitive to input prompts, making prompt design a critical challenge. While automatic prompt optimization (APO) reduces manual engineering, most methods assume access to annotated validation data and ground truth labels. However, in practice, collecting high-quality labels is both expensive and time-consuming. This paper proposes the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO models the problem as a dueling bandit setting, where supervisory signals come from pairwise preference feedback provided by an LLM judge. The framework combines Dueling Thompson Sampling (D-TS) with top-performer-guided mutation, where the former prioritizes informative prompt comparisons and the latter expands the candidate pool by mutating high-performing prompts. PDO naturally applies to label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO demonstrate that PDO consistently outperforms baseline methods across various tasks.

Research Background and Motivation

Problem Definition

The performance of large language models depends significantly on carefully crafted prompts, yet manually creating effective prompts typically requires extensive trial-and-error. Existing automatic prompt optimization (APO) methods, while reducing manual engineering, face the following key challenges:

  1. Label Dependency: Most APO methods rely on annotated validation data to evaluate the performance of candidate prompts
  2. Annotation Cost: In practical applications, obtaining high-quality annotated data is both expensive and time-consuming
  3. Deployment Latency: In industrial scenarios, reasonable prompts must be deployed before large-scale human annotation data becomes available

Research Motivation

The core research question is: Can prompts be optimized without reference to ground truth labels?

To address this, the authors propose leveraging LLMs as judges to evaluate prompt quality through pairwise comparisons rather than independent scoring, obtaining more reliable supervisory signals. This approach faces two main challenges:

  1. LLM Judge Noise: LLM judgments exhibit uncertainty, position bias, and verbosity bias
  2. Quadratic Complexity: The number of pairwise comparisons grows quadratically with the number of candidate prompts

Core Contributions

  1. Problem Modeling Innovation: First to model preference-based prompt optimization as a dueling bandit problem, using pairwise comparisons from an LLM judge as supervisory signals
  2. Algorithm Framework Design: Proposes the PDO framework, combining Dueling Thompson Sampling (D-TS) for efficient prompt selection and top-performer-guided mutation for search space expansion
  3. Theoretical Guarantees: Provides theoretical analysis of Copeland regret bounds, proving that PDO asymptotically converges to the Copeland-optimal prompt
  4. Experimental Validation: Validates PDO's effectiveness on BBH and MS MARCO datasets, with ablation studies demonstrating the contribution of each component
  5. Flexibility: PDO works in purely label-free settings and can also incorporate partial labels to reduce judge noise

Methodology Details

Task Definition

Let X denote the input space and P = {p₁, ..., pₖ} a finite set of candidate prompts. For prompts pᵢ, pⱼ ∈ P and identical input x, binary preferences are obtained via an LLM judge:

Judgex(pi, pj) = {
    1, if fpi(x) ≻ fpj(x)
    0, otherwise
}

The goal is to identify the Condorcet winner (if it exists) or Copeland winner within a limited comparison budget.

Model Architecture

1. Dueling Thompson Sampling (D-TS)

D-TS extends Thompson sampling to the dueling bandit setting, using two independent Thompson samples each round to select informative duels:

Per-Round Process:

  1. First Prompt Selection: Compute optimistic Copeland scores, retain the set of prompts with the highest scores, and select a candidate via Thompson sampling
  2. Second Prompt Selection: Restrict to the set of uncertain opponents and select a dueler via Thompson sampling
  3. Duel and Update: Execute judge comparison and update win-loss statistics

2. Top-Performer-Guided Mutation

To expand the search space, PDO periodically mutates the best-performing prompts:

Mutation Process:

  1. Selection: Select the prompt with the highest current Copeland score
  2. Mutation: Generate variants through template editing, text gradient guidance, or LLM-assisted rewriting
  3. Expansion: Add new variants to the candidate pool

Technical Innovations

  1. Theoretical Foundation: Based on Lipschitz bandit theory, concentrating mutations around top performers is equivalent to "zooming in" the search in the approximate optimal region
  2. Noise Handling: Employs weighted preference matrix updates, down-weighting reasoning-based judgments (which are noisier than answer-based judgments)
  3. Efficiency Optimization: Reduces computational overhead through caching mechanisms and adaptive pruning

Experimental Setup

Datasets

  1. BIG-bench Hard (BBH): Selects 16 multiple-choice reasoning tasks, using accuracy as the evaluation metric
  2. MS MARCO: Four open-ended QA task categories (descriptive, entity, numerical, location), using 1-5 scale LLM ratings

Evaluation Metrics

  • BBH tasks: Accuracy
  • MS MARCO tasks: Integer ratings on a 1-5 scale provided by LLM judge

Baseline Methods

Label-Free Baselines:

  • SPO (Self-Supervised Prompt Optimization)
  • CoT (Chain-of-Thought)
  • PoS (Plan-and-Solve)

Supervised Baselines:

  • APE (Automatic Prompt Engineer)
  • OPRO (Optimization by PROmpting)
  • Breeder (Prompt Evolution)

Implementation Details

  • BBH: 20 initial candidate prompts, 30 rounds, 50 duels per round
  • MS MARCO: 50 initial candidate prompts, 30 rounds, 50 duels per round
  • Uses Llama-3.3-70B-Instruct as the generation, judge, and evaluation model
  • D-TS parameter α = 1.2

Experimental Results

Main Results

BBH Task Performance (Label-Free Setting)

PDO achieves the best performance on 13 out of 16 tasks, with notable improvements including:

  • Tracking-7: 0.641 vs 0.543 (+9.8 percentage points)
  • Web of Lies: 0.942 vs 0.861 (+8.1 percentage points)

MS MARCO Task Performance

On all 4 tasks, PDO with D-TS consistently outperforms RUCB and random sampling, surpassing the SPO baseline within several rounds.

Ablation Studies

  1. D-TS vs Other Sampling Strategies: D-TS significantly outperforms random sampling and RUCB in sample efficiency
  2. Mutation Effects: Top-performer-guided mutation significantly improves performance on Web of Lies and Tracking-7 tasks
  3. Pairwise Preferences vs Pointwise Ratings: Pairwise preferences outperform pointwise ratings in 7 out of 8 model-task combinations

LLM Judge Analysis

  1. Task-Related Noise Levels: Judge reliability varies significantly across tasks; for example, Geometric tasks exhibit larger judgment errors
  2. Role of Partial Labels: Incorporating 30%-50% ground truth labels significantly reduces judgment noise
  3. Model Scale Impact: 70B and 8B models as judges show comparable overall performance

Evolution of APO Methods

Traditional APO methods heavily rely on supervisory signals, while recent research has begun reducing supervision requirements. SPO eliminates external references through output contrast but employs greedy hill-climbing, lacking principled exploration-exploitation balance.

Application of Bandits to Prompt Optimization

OPTS and TRIPLE model prompt strategy selection as bandit problems but still require annotated validation sets. APOHF connects preference-driven prompt optimization with dueling bandits but assumes manually annotated pairwise preferences.

Conclusions and Discussion

Main Conclusions

  1. PDO successfully addresses label-free prompt optimization through a dueling bandit framework achieving sample-efficient search
  2. D-TS identifies high-quality prompts faster and more reliably than random sampling and other dueling bandit methods
  3. Top-performer-guided mutation effectively directs search toward stronger regions
  4. Pairwise preferences provide more stable supervisory signals than pointwise ratings

Limitations

  1. Judge Dependency: Optimization quality depends on the LLM judge's capabilities and meta-prompt design
  2. Style Preference Risk: The algorithm may favor stylistic patterns preferred by the judge rather than true task metrics
  3. Computational Resource Constraints: Due to resource limitations, extensive experiments across more models were not conducted

Future Directions

  1. Improve alignment between LLM judges and task objectives
  2. Develop adaptive adjustment mechanisms reflecting judgment reliability
  3. Explore more sophisticated uncertainty capture mechanisms

In-Depth Evaluation

Strengths

  1. Problem Modeling Innovation: Modeling prompt optimization as a dueling bandit problem has both theoretical foundation and practical value
  2. Method Completeness: Combines efficient selection strategies with search space expansion, forming a comprehensive optimization framework
  3. Comprehensive Experiments: Thorough evaluation across multiple datasets, including ablation studies and judge analysis
  4. Theoretical Guarantees: Provides theoretical analysis of Copeland regret bounds

Weaknesses

  1. Judge Noise Handling: While analyzing judge noise, the proposed solutions are relatively simple
  2. Scalability: Performance on large-scale candidate prompt sets is insufficiently validated
  3. Task Generalization: Primarily validated on reasoning and QA tasks; applicability to other task types remains unclear

Impact

  1. Academic Contribution: Provides a novel theoretical framework and practical method for label-free prompt optimization
  2. Practical Value: Has direct application value in industrial scenarios, particularly when annotated data is scarce
  3. Reproducibility: Authors commit to open-sourcing code, facilitating method reproduction and further research

Applicable Scenarios

  1. Scarce Annotation Data: New domains or tasks lacking abundant annotated data
  2. Rapid Deployment Requirements: Industrial applications requiring reasonable prompts in short timeframes
  3. Cost-Sensitive Applications: Scenarios where annotation costs are prohibitive
  4. Multi-Task Optimization: Simultaneously optimizing prompts for multiple related tasks

References

The paper cites multiple important related works, including:

  • Zhou et al. (2022) - APE method
  • Yang et al. (2024) - OPRO method
  • Fernando et al. (2023) - Breeder method
  • Wu and Liu (2016) - Dueling Thompson Sampling theory
  • Zheng et al. (2023) - Related research on LLMs as judges

Overall Assessment: This is an important contribution to the prompt optimization field that effectively addresses the practical need for label-free prompt optimization through innovative problem modeling and theoretical frameworks. The method design is sound, experimental validation is comprehensive, and it possesses strong theoretical foundations and practical value.