Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.
Large language models (LLMs) are highly sensitive to input prompts, making prompt design a critical challenge. While automatic prompt optimization (APO) reduces manual engineering, most methods assume access to annotated validation data and ground truth labels. However, in practice, collecting high-quality labels is both expensive and time-consuming. This paper proposes the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO models the problem as a dueling bandit setting, where supervisory signals come from pairwise preference feedback provided by an LLM judge. The framework combines Dueling Thompson Sampling (D-TS) with top-performer-guided mutation, where the former prioritizes informative prompt comparisons and the latter expands the candidate pool by mutating high-performing prompts. PDO naturally applies to label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO demonstrate that PDO consistently outperforms baseline methods across various tasks.
The performance of large language models depends significantly on carefully crafted prompts, yet manually creating effective prompts typically requires extensive trial-and-error. Existing automatic prompt optimization (APO) methods, while reducing manual engineering, face the following key challenges:
Label Dependency: Most APO methods rely on annotated validation data to evaluate the performance of candidate prompts
Annotation Cost: In practical applications, obtaining high-quality annotated data is both expensive and time-consuming
Deployment Latency: In industrial scenarios, reasonable prompts must be deployed before large-scale human annotation data becomes available
The core research question is: Can prompts be optimized without reference to ground truth labels?
To address this, the authors propose leveraging LLMs as judges to evaluate prompt quality through pairwise comparisons rather than independent scoring, obtaining more reliable supervisory signals. This approach faces two main challenges:
LLM Judge Noise: LLM judgments exhibit uncertainty, position bias, and verbosity bias
Quadratic Complexity: The number of pairwise comparisons grows quadratically with the number of candidate prompts
Problem Modeling Innovation: First to model preference-based prompt optimization as a dueling bandit problem, using pairwise comparisons from an LLM judge as supervisory signals
Algorithm Framework Design: Proposes the PDO framework, combining Dueling Thompson Sampling (D-TS) for efficient prompt selection and top-performer-guided mutation for search space expansion
Theoretical Guarantees: Provides theoretical analysis of Copeland regret bounds, proving that PDO asymptotically converges to the Copeland-optimal prompt
Experimental Validation: Validates PDO's effectiveness on BBH and MS MARCO datasets, with ablation studies demonstrating the contribution of each component
Flexibility: PDO works in purely label-free settings and can also incorporate partial labels to reduce judge noise
Let X denote the input space and P = {p₁, ..., pₖ} a finite set of candidate prompts. For prompts pᵢ, pⱼ ∈ P and identical input x, binary preferences are obtained via an LLM judge:
D-TS extends Thompson sampling to the dueling bandit setting, using two independent Thompson samples each round to select informative duels:
Per-Round Process:
First Prompt Selection: Compute optimistic Copeland scores, retain the set of prompts with the highest scores, and select a candidate via Thompson sampling
Second Prompt Selection: Restrict to the set of uncertain opponents and select a dueler via Thompson sampling
Duel and Update: Execute judge comparison and update win-loss statistics
Theoretical Foundation: Based on Lipschitz bandit theory, concentrating mutations around top performers is equivalent to "zooming in" the search in the approximate optimal region
Noise Handling: Employs weighted preference matrix updates, down-weighting reasoning-based judgments (which are noisier than answer-based judgments)
Efficiency Optimization: Reduces computational overhead through caching mechanisms and adaptive pruning
Traditional APO methods heavily rely on supervisory signals, while recent research has begun reducing supervision requirements. SPO eliminates external references through output contrast but employs greedy hill-climbing, lacking principled exploration-exploitation balance.
OPTS and TRIPLE model prompt strategy selection as bandit problems but still require annotated validation sets. APOHF connects preference-driven prompt optimization with dueling bandits but assumes manually annotated pairwise preferences.
The paper cites multiple important related works, including:
Zhou et al. (2022) - APE method
Yang et al. (2024) - OPRO method
Fernando et al. (2023) - Breeder method
Wu and Liu (2016) - Dueling Thompson Sampling theory
Zheng et al. (2023) - Related research on LLMs as judges
Overall Assessment: This is an important contribution to the prompt optimization field that effectively addresses the practical need for label-free prompt optimization through innovative problem modeling and theoretical frameworks. The method design is sound, experimental validation is comprehensive, and it possesses strong theoretical foundations and practical value.