2025-11-22T14:58:15.937648

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Zhang, Ye, Heng et al.
Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control
academic

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Basic Information

  • Paper ID: 2510.12121
  • Title: Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
  • Authors: Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sudheer Chava, Chao Zhang
  • Classification: cs.AI cs.CL cs.LG
  • Publication Date/Venue: Preprint (Under review)
  • Paper Link: https://arxiv.org/abs/2510.12121

Abstract

This paper proposes PRE-CONTROL, a method for precise control of attribute intensity in large language models (LLMs). The method achieves precise attribute intensity control through three key designs: (1) reformulating precise attribute intensity control as a goal-reaching problem rather than simple maximization; (2) training a lightweight value function via temporal difference learning to predict final attribute intensity scores from partial generations; (3) employing gradient-based interventions on hidden representations to precisely navigate the model toward specific attribute intensity targets. Experiments demonstrate that the method can guide text generation to user-specified attribute intensities and exhibits efficiency improvements in downstream tasks including preference data synthesis, Pareto frontier approximation, and aligned behavior distillation.

Research Background and Motivation

Problem Definition

Current LLM alignment methods suffer from a critical limitation: they can only provide directional or open-ended guidance without reliably achieving precise attribute intensities. For example, a user might desire an email's formality level to be 3 on a 5-point scale, rather than simply "more formal" or "less formal."

Problem Importance

Precise attribute intensity control is essential for building AI systems that adapt to diverse user expectations, particularly in multi-objective alignment scenarios where conflicts exist between different attributes, requiring scalar-level adjustments on continuous scales to find optimal trade-offs.

Limitations of Existing Methods

  1. RLHF and DPO: Produce static models capturing average desired behavior, requiring expensive retraining to adjust priorities
  2. Prompting methods: Entirely dependent on model interpretation of style instructions, yielding inconsistent results
  3. Guided decoding: Typically treats attribute intensity as categorical rather than continuous
  4. Multi-objective alignment methods: Require substantial training to approximate global Pareto sets

Research Motivation

Existing methods lack the capability for precise attribute intensity control. This work aims to achieve fine-grained, continuous attribute intensity control, transcending simple directional alignment.

Core Contributions

  1. Problem Reformulation: Reformulates precise attribute intensity control as a goal-reaching problem rather than simple maximization/minimization
  2. Value Function Approach: Trains a lightweight value function via temporal difference learning to predict final attribute scores from partial generations
  3. Representation Editing Technique: Employs gradient-based hidden representation interventions to precisely navigate to specific attribute intensity targets
  4. Efficient Applications: Demonstrates efficiency advantages in Pareto frontier approximation (reducing time complexity from O(m^d) to O(n+k)) and controllable model distillation

Methodology Details

Task Definition

Given target attribute intensity τ ∈ 0,1 and reward function R(x), the objective is to generate text whose attribute intensity score matches the target value, rather than simply maximizing reward.

Model Architecture

1. Goal-Reaching Problem Reformulation

Traditional alignment objective:

max_θ E_{x~π_θ}[R(x)]

Goal-reaching formulation in this work:

min_θ E_{x~π_θ}[(R̂(x) - τ)²]

where R̂(x) is the reward function normalized to 0,1.

2. Value Function Training

Uses TD(λ) to train value function V_φ(h_t) predicting expected attribute intensity for partial sequences:

V_φ(h_t) ≈ E_{x>t~π_θ(·|x≤t)}[R̂(x≤t, x>t)]

Generalized return computation:

G^λ_t = (1-λ)∑_{n=1}^{T-t-1} λ^{n-1}V_φ(s_{t+n}) + λ^{T-t-1}r_T

Value function loss:

L_TD = E_{t,s_t}[(V_φ(s_t) - G^λ_t)²]

3. Test-Time Intervention

Adjusts hidden states via gradient descent:

h_t ← h_t - α∇_{h_t}(V_φ(h_t) - τ)²

Multi-attribute case:

h_t ← h_t - α∇_{h_t}∑_{i=1}^m w_i(V^i_φ(h_t) - τ_i)²

Technical Innovations

  1. Goal-Oriented Design: Transitions from directional optimization to precise goal-reaching
  2. Real-Time Feedback Mechanism: Value function provides intermediate feedback during generation
  3. Representation Space Navigation: Direct precise navigation in high-dimensional representation space
  4. Multi-Attribute Coordination: Simultaneously controls multiple potentially conflicting attributes

Experimental Setup

Datasets

  1. HelpSteer2: 20,324 training samples, 1,038 test samples, containing 5 attributes (helpfulness, correctness, coherence, complexity, verbosity)
  2. Code-UltraFeedback: 10,000 complex instructions with 5 programming-related attributes (complexity & efficiency, style, explanation, instruction-following, readability)

Evaluation Metrics

  1. Self-BLEU Score: Measures diversity of generated text (lower is better)
  2. ℓ1 Distance to Target: Evaluates proximity of model outputs to user-specified attribute scores
  3. Success Rate: Frequency with which model outputs precisely match expected attribute configurations

Baseline Methods

  • Base: Direct generation from base model
  • Prompting: Includes target attribute scores in prompts
  • ITI: Trains linear layer to predict rewards and adjusts activations along learned directions
  • MAT-Steer: Learns sparse, orthogonal multi-attribute steering vectors
  • RE-Control: Performs test-time intervention with open-ended optimization

Implementation Details

  • Base models: LLaMA-3.2-3b and Phi-4-mini
  • Value function: 4-layer MLP
  • Reward model: ArmoRM-Llama3-8B
  • Intervention layer: Final transformer layer
  • Optimizer: Adam with early stopping

Experimental Results

Main Results

Experimental results on representative target scores show:

Positive Target (HelpSteer2 4,4,4,2,2):

  • LLaMA-3.2-3b: PRE-CONTROL success rate 7.96% vs best baseline 5.39%
  • Phi-4-mini: PRE-CONTROL success rate 8.31% vs best baseline 5.70%

Negative Target (HelpSteer2 3,3,3,2,2):

  • LLaMA-3.2-3b: PRE-CONTROL success rate 6.60% vs best baseline 5.84%
  • Phi-4-mini: PRE-CONTROL success rate 9.11% vs best baseline 8.73%

Code-UltraFeedback Results:

  • Positive target 3,3,3,3,3: Success rate improves to 17.46%-26.16%
  • Negative target 2,2,2,2,2: Success rate improves to 22.34%-30.68%

Iterative Intervention Results

PRE-CONTROL demonstrates continuous performance improvement across multiple iterations, while other methods plateau after the second iteration.

Pareto Frontier Approximation

  • Quality Improvement: Hypervolume increases from 7.54 to 12.66
  • Efficiency Improvement: Computational overhead reduces from 3.3 GPU hours to 0.4 hours (8-fold reduction)
  • More Points Discovered: Non-dominated points increase from 45 to 69

Controllable Distillation

Achieves hypervolume of 16.81 using 15k samples and 2.1 GPU hours, outperforming Best-of-N method's 15.27 (requiring 50k samples and 7.8 GPU hours).

Case Analysis

Qualitative analysis demonstrates PRE-CONTROL's ability to:

  • Negative Control: Precisely adjust overly detailed responses 4,4,4,3,3 to concise versions 3,3,3,2,2
  • Positive Control: Expand simple responses 4,4,4,1,1 to more detailed versions 4,4,4,2,2

LLM Alignment

  1. Fine-tuning Paradigms: RLHF and DPO require multi-stage training, resource-intensive
  2. Inference-Time Interventions: Prompt engineering and guided decoding lack precise control mechanisms
  3. Multi-Objective Alignment: Existing methods require expensive retraining to inject multi-objective preferences

Representation Engineering

  1. Activation Perturbation: Evolved from plug-and-play methods to learned steering vectors
  2. Representation Fine-tuning: Uses low-rank projection matrices for efficient activation editing
  3. Limitations: Primarily focus on binary or categorical attribute control rather than precise targets on continuous scales

Conclusions and Discussion

Main Conclusions

  1. PRE-CONTROL achieves precise attribute intensity control in LLMs
  2. Goal-reaching formulation is more suitable for precise control than traditional maximization approaches
  3. The combination of value functions and gradient interventions provides an effective control mechanism
  4. The method demonstrates efficiency advantages in multiple downstream applications

Limitations

  1. Value Function as Reward Model Proxy: Lightweight MLPs may fail to capture all details of the original reward signal
  2. Final-Layer Intervention: Current implementation applies interventions only at the final transformer layer, potentially underutilizing the model's representational hierarchy
  3. Computational Overhead: While relatively efficient, still requires additional value function training and inference-time computation

Future Directions

  1. Explore more sophisticated value function architectures to better approximate reward model capabilities
  2. Investigate multi-layer intervention strategies or attention-level modifications
  3. Develop adaptive mechanisms to selectively query full reward models for difficult cases

In-Depth Evaluation

Strengths

  1. Strong Innovation: Reformulates attribute control as goal-reaching, breaking through limitations of traditional directional alignment
  2. Systematic Methodology: Value function training, TD learning, and gradient interventions form a complete technical framework
  3. Comprehensive Experiments: Thorough evaluation across two datasets and two models, including ablation studies and application validation
  4. High Practical Value: Demonstrates significant efficiency improvements in Pareto frontier approximation and model distillation

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks convergence guarantees and theoretical analysis of intervention stability
  2. Value Function Dependency: Method performance heavily depends on value function quality
  3. Generalization Capability: Validation limited to specific attributes and models; broader generalization remains to be verified
  4. Computational Complexity: While relatively efficient, inference still requires additional computation

Impact

  1. Academic Contribution: Provides new research paradigm for precise LLM control
  2. Practical Value: Offers effective tools for personalized AI systems and multi-objective optimization
  3. Reproducibility: Authors provide complete code and experimental configurations

Applicable Scenarios

  1. Personalized Content Generation: Requires precise control of text style, complexity, and other attributes
  2. Multi-Objective Optimization: Finding optimal balance points among conflicting attributes
  3. Model Alignment: Efficiently generating training data satisfying specific attribute requirements
  4. Interactive AI Systems: Dynamically adjusting output attributes based on user feedback

References

The paper cites 46 relevant references covering key areas including LLM alignment, multi-objective optimization, and representation engineering, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality research paper proposing an innovative method for precise attribute intensity control, demonstrating excellence in both theoretical contribution and practical value. The methodology is well-designed, experiments are thoroughly validated, and the work makes significant contributions to the LLM control field.