Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.
大语言模型(LLMs)对输入提示词高度敏感,使得提示词设计成为核心挑战。虽然自动提示词优化(APO)减少了人工工程,但大多数方法假设能够获得标注验证数据等真实标签。然而在实践中,收集高质量标签既昂贵又耗时。本文提出了提示词对决优化器(PDO),这是一个用于无标签提示词优化的样本高效框架。PDO将问题建模为对决赌博机设置,其中监督信号来自LLM裁判提供的成对偏好反馈。该框架结合了双Thompson采样(D-TS)和顶级表现者引导变异,前者优先考虑信息丰富的提示词比较,后者通过变异高性能提示词来扩展候选池。PDO天然适用于无标签设置,还可以结合部分标签来缓解裁判噪音。在BIG-bench Hard (BBH)和MS MARCO上的实验表明,PDO在各项任务上始终优于基线方法。