On the Role of Preference Variance in Preference Optimization
Guo, Li, Qiu et al.
Direct Preference Optimization (DPO) has emerged as an important approach for learning from human preferences in aligning large language models (LLMs). However, collecting human preference data is costly and inefficient, motivating methods to reduce the required annotations. In this work, we investigate the impact of \emph{preference variance} (PVar), which measures the variance in model preferences when comparing pairs of responses, on the effectiveness of DPO training. We provide a theoretical insight by establishing an upper bound on the DPO gradient norm for any given prompt, showing it is controlled by the PVar of that prompt. This implies that prompts with low PVar can only produce small gradient updates, making them less valuable for learning. We validate this finding by fine-tuning LLMs with preferences generated by a reward model, evaluating on two benchmarks (AlpacaEval 2.0 and Arena-Hard). Experimental results demonstrate that prompts with higher PVar outperform randomly selected prompts or those with lower PVar. We also show that our PVar-based selection method is robust, when using smaller reward models (1B, 3B) for selection. Notably, in a separate experiment using the original human annotations from the UltraFeedback dataset, we found that training on only the top 10\% of prompts with the highest PVar yields better evaluation performance than training on the full dataset, highlighting the importance of preference variance in identifying informative examples for efficient LLM alignment.
academic
On the Role of Preference Variance in Preference Optimization
Direct Preference Optimization (DPO) 已成为从人类偏好中学习以对齐大语言模型(LLMs)的重要方法。然而,收集人类偏好数据成本高昂且效率低下,这促使研究者寻找减少标注需求的方法。本文研究了偏好方差(PVar)对DPO训练有效性的影响,PVar衡量的是在比较响应对时模型偏好的方差。研究提供了理论洞察,通过建立任意给定提示的DPO梯度范数上界,表明其受该提示PVar的控制。这意味着低PVar的提示只能产生小的梯度更新,使其对学习价值较低。实验结果表明,具有较高PVar的提示优于随机选择或较低PVar的提示。值得注意的是,在使用UltraFeedback数据集原始人类标注的实验中,仅使用最高PVar的前10%提示进行训练就能获得比使用完整数据集更好的评估性能。