Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Zhang, Ye, Heng et al.
Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control
academic
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
This paper proposes PRE-CONTROL, a method for precise control of attribute intensity in large language models (LLMs). The method achieves precise attribute intensity control through three key designs: (1) reformulating precise attribute intensity control as a goal-reaching problem rather than simple maximization; (2) training a lightweight value function via temporal difference learning to predict final attribute intensity scores from partial generations; (3) employing gradient-based interventions on hidden representations to precisely navigate the model toward specific attribute intensity targets. Experiments demonstrate that the method can guide text generation to user-specified attribute intensities and exhibits efficiency improvements in downstream tasks including preference data synthesis, Pareto frontier approximation, and aligned behavior distillation.
Current LLM alignment methods suffer from a critical limitation: they can only provide directional or open-ended guidance without reliably achieving precise attribute intensities. For example, a user might desire an email's formality level to be 3 on a 5-point scale, rather than simply "more formal" or "less formal."
Precise attribute intensity control is essential for building AI systems that adapt to diverse user expectations, particularly in multi-objective alignment scenarios where conflicts exist between different attributes, requiring scalar-level adjustments on continuous scales to find optimal trade-offs.
Existing methods lack the capability for precise attribute intensity control. This work aims to achieve fine-grained, continuous attribute intensity control, transcending simple directional alignment.
Problem Reformulation: Reformulates precise attribute intensity control as a goal-reaching problem rather than simple maximization/minimization
Value Function Approach: Trains a lightweight value function via temporal difference learning to predict final attribute scores from partial generations
Representation Editing Technique: Employs gradient-based hidden representation interventions to precisely navigate to specific attribute intensity targets
Efficient Applications: Demonstrates efficiency advantages in Pareto frontier approximation (reducing time complexity from O(m^d) to O(n+k)) and controllable model distillation
Given target attribute intensity τ ∈ 0,1 and reward function R(x), the objective is to generate text whose attribute intensity score matches the target value, rather than simply maximizing reward.
Value Function as Reward Model Proxy: Lightweight MLPs may fail to capture all details of the original reward signal
Final-Layer Intervention: Current implementation applies interventions only at the final transformer layer, potentially underutilizing the model's representational hierarchy
Computational Overhead: While relatively efficient, still requires additional value function training and inference-time computation
The paper cites 46 relevant references covering key areas including LLM alignment, multi-objective optimization, and representation engineering, providing a solid theoretical foundation for the research.
Overall Assessment: This is a high-quality research paper proposing an innovative method for precise attribute intensity control, demonstrating excellence in both theoretical contribution and practical value. The methodology is well-designed, experiments are thoroughly validated, and the work makes significant contributions to the LLM control field.