2025-11-22T14:58:15.937648

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Zhang, Ye, Heng et al.

Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control

academic

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Basic Information

Paper ID: 2510.12121
Title: Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Authors: Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sudheer Chava, Chao Zhang
Classification: cs.AI cs.CL cs.LG
Publication Date/Venue: Preprint (Under review)
Paper Link: https://arxiv.org/abs/2510.12121

Abstract

This paper proposes PRE-CONTROL, a method for precise control of attribute intensity in large language models (LLMs). The method achieves precise attribute intensity control through three key designs: (1) reformulating precise attribute intensity control as a goal-reaching problem rather than simple maximization; (2) training a lightweight value function via temporal difference learning to predict final attribute intensity scores from partial generations; (3) employing gradient-based interventions on hidden representations to precisely navigate the model toward specific attribute intensity targets. Experiments demonstrate that the method can guide text generation to user-specified attribute intensities and exhibits efficiency improvements in downstream tasks including preference data synthesis, Pareto frontier approximation, and aligned behavior distillation.

Research Background and Motivation

Problem Definition

Current LLM alignment methods suffer from a critical limitation: they can only provide directional or open-ended guidance without reliably achieving precise attribute intensities. For example, a user might desire an email's formality level to be 3 on a 5-point scale, rather than simply "more formal" or "less formal."

Problem Importance

Precise attribute intensity control is essential for building AI systems that adapt to diverse user expectations, particularly in multi-objective alignment scenarios where conflicts exist between different attributes, requiring scalar-level adjustments on continuous scales to find optimal trade-offs.

Limitations of Existing Methods

RLHF and DPO: Produce static models capturing average desired behavior, requiring expensive retraining to adjust priorities
Prompting methods: Entirely dependent on model interpretation of style instructions, yielding inconsistent results
Guided decoding: Typically treats attribute intensity as categorical rather than continuous
Multi-objective alignment methods: Require substantial training to approximate global Pareto sets

Research Motivation

Existing methods lack the capability for precise attribute intensity control. This work aims to achieve fine-grained, continuous attribute intensity control, transcending simple directional alignment.

Core Contributions

Problem Reformulation: Reformulates precise attribute intensity control as a goal-reaching problem rather than simple maximization/minimization
Value Function Approach: Trains a lightweight value function via temporal difference learning to predict final attribute scores from partial generations
Representation Editing Technique: Employs gradient-based hidden representation interventions to precisely navigate to specific attribute intensity targets
Efficient Applications: Demonstrates efficiency advantages in Pareto frontier approximation (reducing time complexity from O(m^d) to O(n+k)) and controllable model distillation

Methodology Details

Task Definition

Given target attribute intensity τ ∈ 0,1 and reward function R(x), the objective is to generate text whose attribute intensity score matches the target value, rather than simply maximizing reward.

Model Architecture

1. Goal-Reaching Problem Reformulation

Traditional alignment objective:

max_θ E_{x~π_θ}[R(x)]

Goal-reaching formulation in this work:

min_θ E_{x~π_θ}[(R̂(x) - τ)²]

where R̂(x) is the reward function normalized to 0,1.

2. Value Function Training

Uses TD(λ) to train value function V_φ(h_t) predicting expected attribute intensity for partial sequences:

V_φ(h_t) ≈ E_{x>t~π_θ(·|x≤t)}[R̂(x≤t, x>t)]

Generalized return computation:

G^λ_t = (1-λ)∑_{n=1}^{T-t-1} λ^{n-1}V_φ(s_{t+n}) + λ^{T-t-1}r_T

Value function loss:

L_TD = E_{t,s_t}[(V_φ(s_t) - G^λ_t)²]

3. Test-Time Intervention

Adjusts hidden states via gradient descent:

h_t ← h_t - α∇_{h_t}(V_φ(h_t) - τ)²

Multi-attribute case:

h_t ← h_t - α∇_{h_t}∑_{i=1}^m w_i(V^i_φ(h_t) - τ_i)²

Technical Innovations

Goal-Oriented Design: Transitions from directional optimization to precise goal-reaching
Real-Time Feedback Mechanism: Value function provides intermediate feedback during generation
Representation Space Navigation: Direct precise navigation in high-dimensional representation space
Multi-Attribute Coordination: Simultaneously controls multiple potentially conflicting attributes

Experimental Setup

Datasets

HelpSteer2: 20,324 training samples, 1,038 test samples, containing 5 attributes (helpfulness, correctness, coherence, complexity, verbosity)
Code-UltraFeedback: 10,000 complex instructions with 5 programming-related attributes (complexity & efficiency, style, explanation, instruction-following, readability)

Evaluation Metrics

Self-BLEU Score: Measures diversity of generated text (lower is better)
ℓ1 Distance to Target: Evaluates proximity of model outputs to user-specified attribute scores
Success Rate: Frequency with which model outputs precisely match expected attribute configurations

Baseline Methods

Base: Direct generation from base model
Prompting: Includes target attribute scores in prompts
ITI: Trains linear layer to predict rewards and adjusts activations along learned directions
MAT-Steer: Learns sparse, orthogonal multi-attribute steering vectors
RE-Control: Performs test-time intervention with open-ended optimization

Implementation Details

Base models: LLaMA-3.2-3b and Phi-4-mini
Value function: 4-layer MLP
Reward model: ArmoRM-Llama3-8B
Intervention layer: Final transformer layer
Optimizer: Adam with early stopping

Experimental Results

Main Results

Experimental results on representative target scores show:

Positive Target (HelpSteer2 4,4,4,2,2):

LLaMA-3.2-3b: PRE-CONTROL success rate 7.96% vs best baseline 5.39%
Phi-4-mini: PRE-CONTROL success rate 8.31% vs best baseline 5.70%

Negative Target (HelpSteer2 3,3,3,2,2):

LLaMA-3.2-3b: PRE-CONTROL success rate 6.60% vs best baseline 5.84%
Phi-4-mini: PRE-CONTROL success rate 9.11% vs best baseline 8.73%

Code-UltraFeedback Results:

Positive target 3,3,3,3,3: Success rate improves to 17.46%-26.16%
Negative target 2,2,2,2,2: Success rate improves to 22.34%-30.68%

Iterative Intervention Results

PRE-CONTROL demonstrates continuous performance improvement across multiple iterations, while other methods plateau after the second iteration.

Pareto Frontier Approximation

Quality Improvement: Hypervolume increases from 7.54 to 12.66
Efficiency Improvement: Computational overhead reduces from 3.3 GPU hours to 0.4 hours (8-fold reduction)
More Points Discovered: Non-dominated points increase from 45 to 69

Controllable Distillation

Achieves hypervolume of 16.81 using 15k samples and 2.1 GPU hours, outperforming Best-of-N method's 15.27 (requiring 50k samples and 7.8 GPU hours).

Case Analysis

Qualitative analysis demonstrates PRE-CONTROL's ability to:

Negative Control: Precisely adjust overly detailed responses 4,4,4,3,3 to concise versions 3,3,3,2,2
Positive Control: Expand simple responses 4,4,4,1,1 to more detailed versions 4,4,4,2,2

LLM Alignment

Fine-tuning Paradigms: RLHF and DPO require multi-stage training, resource-intensive
Inference-Time Interventions: Prompt engineering and guided decoding lack precise control mechanisms
Multi-Objective Alignment: Existing methods require expensive retraining to inject multi-objective preferences

Representation Engineering

Activation Perturbation: Evolved from plug-and-play methods to learned steering vectors
Representation Fine-tuning: Uses low-rank projection matrices for efficient activation editing
Limitations: Primarily focus on binary or categorical attribute control rather than precise targets on continuous scales

Conclusions and Discussion

Main Conclusions

PRE-CONTROL achieves precise attribute intensity control in LLMs
Goal-reaching formulation is more suitable for precise control than traditional maximization approaches
The combination of value functions and gradient interventions provides an effective control mechanism
The method demonstrates efficiency advantages in multiple downstream applications

Limitations

Value Function as Reward Model Proxy: Lightweight MLPs may fail to capture all details of the original reward signal
Final-Layer Intervention: Current implementation applies interventions only at the final transformer layer, potentially underutilizing the model's representational hierarchy
Computational Overhead: While relatively efficient, still requires additional value function training and inference-time computation

Future Directions

Explore more sophisticated value function architectures to better approximate reward model capabilities
Investigate multi-layer intervention strategies or attention-level modifications
Develop adaptive mechanisms to selectively query full reward models for difficult cases

In-Depth Evaluation

Strengths

Strong Innovation: Reformulates attribute control as goal-reaching, breaking through limitations of traditional directional alignment
Systematic Methodology: Value function training, TD learning, and gradient interventions form a complete technical framework
Comprehensive Experiments: Thorough evaluation across two datasets and two models, including ablation studies and application validation
High Practical Value: Demonstrates significant efficiency improvements in Pareto frontier approximation and model distillation

Weaknesses

Insufficient Theoretical Analysis: Lacks convergence guarantees and theoretical analysis of intervention stability
Value Function Dependency: Method performance heavily depends on value function quality
Generalization Capability: Validation limited to specific attributes and models; broader generalization remains to be verified
Computational Complexity: While relatively efficient, inference still requires additional computation

Impact

Academic Contribution: Provides new research paradigm for precise LLM control
Practical Value: Offers effective tools for personalized AI systems and multi-objective optimization
Reproducibility: Authors provide complete code and experimental configurations

Applicable Scenarios

Personalized Content Generation: Requires precise control of text style, complexity, and other attributes
Multi-Objective Optimization: Finding optimal balance points among conflicting attributes
Model Alignment: Efficiently generating training data satisfying specific attribute requirements
Interactive AI Systems: Dynamically adjusting output attributes based on user feedback

References

The paper cites 46 relevant references covering key areas including LLM alignment, multi-objective optimization, and representation engineering, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality research paper proposing an innovative method for precise attribute intensity control, demonstrating excellence in both theoretical contribution and practical value. The methodology is well-designed, experiments are thoroughly validated, and the work makes significant contributions to the LLM control field.