2025-11-22T14:58:15.937648

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Zhang, Ye, Heng et al.

Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control

academic

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

基本信息

论文ID: 2510.12121
标题: Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
作者: Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sudheer Chava, Chao Zhang
分类: cs.AI cs.CL cs.LG
发表时间/会议: Preprint (Under review)
论文链接: https://arxiv.org/abs/2510.12121

RLHF和DPO：产生静态模型，捕获期望行为的平均值，需要昂贵的重训练来调整优先级
提示方法：完全依赖模型对风格指令的解释，结果不一致
引导解码：通常将属性强度视为分类而非连续值
多目标对齐方法：需要大量训练来近似全局帕累托集

研究动机

现有方法缺乏精确属性强度控制的能力，本文旨在实现细粒度、连续的属性强度控制，超越简单的方向性对齐。

核心贡献

重新表述问题：将精确属性强度控制表述为目标达成问题，而非简单的最大化/最小化
价值函数方法：通过时序差分学习训练轻量级价值函数，从部分生成预测最终属性分数
表示编辑技术：采用基于梯度的隐藏表示干预，精确导航到特定属性强度目标
高效应用：在帕累托前沿近似（时间复杂度从O(m^d)降至O(n+k)）和可控模型蒸馏中展现效率优势

max_θ E_{x~π_θ}[R(x)]

本文目标达成表述：

min_θ E_{x~π_θ}[(R̂(x) - τ)²]

其中R̂(x)是归一化到0,1的奖励函数。

2. 价值函数训练

使用TD(λ)训练价值函数V_φ(h_t)预测部分序列的期望属性强度：

V_φ(h_t) ≈ E_{x>t~π_θ(·|x≤t)}[R̂(x≤t, x>t)]

广义回报计算：

G^λ_t = (1-λ)∑_{n=1}^{T-t-1} λ^{n-1}V_φ(s_{t+n}) + λ^{T-t-1}r_T

价值函数损失：

L_TD = E_{t,s_t}[(V_φ(s_t) - G^λ_t)²]

3. 测试时干预

通过梯度下降调整隐藏状态：

h_t ← h_t - α∇_{h_t}(V_φ(h_t) - τ)²

多属性情况：

h_t ← h_t - α∇_{h_t}∑_{i=1}^m w_i(V^i_φ(h_t) - τ_i)²

技术创新点

目标导向设计：从方向性优化转向精确目标达成
实时反馈机制：价值函数提供生成过程中的中间反馈
表示空间导航：直接在高维表示空间中进行精确导航
多属性协调：同时控制多个可能冲突的属性

实验设置

数据集

HelpSteer2：20,324个训练样本，1,038个测试样本，包含5个属性（helpfulness, correctness, coherence, complexity, verbosity）
Code-UltraFeedback：10,000个复杂指令，包含5个编程相关属性（complexity & efficiency, style, explanation, instruction-following, readability）