2025-11-24T00:22:17.812402

Human-in-the-loop: Real-time Preference Optimization

Wang, Xu, Jones

Optimization with preference feedback is an active research area with many applications in engineering systems where humans play a central role, such as building control and autonomous vehicles. While most existing studies focus on optimizing a static user utility, few have investigated its closed-loop behavior that accounts for system transients. In this work, we propose an online feedback optimization controller that can optimize user utility using pairwise comparison feedback with both optimality and closed-loop stability guarantees. By adding a random exploration signal, the controller estimates the gradient based on the binary utility comparison feedback between two consecutive time steps. We analyze its closed-loop behavior when interacting with a nonlinear plant and show that, under mild assumptions, the controller converges to the optimal point without inducing instability. Theoretical findings are further validated through numerical experiments.

academic

Human-in-the-loop: Real-time Preference Optimization

Basic Information

Paper ID: 2506.02225
Title: Human-in-the-loop: Real-time Preference Optimization
Authors: Wenbin Wang, Wenjie Xu, Colin N. Jones (Automatic Control Laboratory, EPFL)
Classification: math.OC (Optimization and Control)
Publication Date: arXiv preprint, November 3, 2025 (v2)
Paper Link: https://arxiv.org/abs/2506.02225

Abstract

This paper investigates optimization problems with preference feedback, which have broad applications in human-centric engineering systems such as building control and autonomous driving. Existing research primarily focuses on static user utility optimization with limited consideration of closed-loop transient behavior. The paper proposes an online feedback optimization controller that leverages pairwise comparison feedback to optimize user utility while providing optimality and closed-loop stability guarantees. By incorporating random exploration signals, the controller estimates gradients based on binary utility comparisons between consecutive time steps. The authors analyze the closed-loop behavior when the controller interacts with nonlinear systems and prove that under mild assumptions, the controller converges to the optimal point without inducing instability. Theoretical findings are validated through numerical experiments.

Research Background and Motivation

Problems to be Addressed

Human-machine interactive control: How to design human-aware controllers that optimize user utility in real-time, enabling systems to adapt based on user preferences
Real-time optimization with preference feedback: How to perform online optimization using binary preference comparisons rather than absolute utility values
Closed-loop stability guarantees: How to ensure the optimization process does not destabilize the system while considering transient behavior

Problem Significance

Individual differences: Traditional controllers track predefined reference points based on large-scale population models (e.g., indoor temperature in building control), introducing bias and suboptimal performance due to inability to account for individual variations
Time-varying utility: Without real-time human feedback, controllers cannot respond to time-varying utility and lack robustness to external disturbances
Human cognitive characteristics: Humans excel at relative comparisons rather than absolute assessments, making preference feedback naturally occur in pairwise comparison form

Limitations of Existing Methods

Online feedback optimization (OFO): Existing OFO methods (e.g., grid control, robot coordination) require precise utility values or gradient information, making direct application to human preference feedback challenging
Offline preference optimization:
- Most research considers static problems, ignoring system transient behavior
- Existing gradient estimation methods (e.g., 18,19) require two function evaluations per time step, unsuitable for online implementation
- Lack of closed-loop stability analysis
Stability quantification difficulty: The binary nature of preference feedback makes overall dynamics highly nonlinear, complicating stability analysis
Limited user knowledge: Users typically have limited understanding of system dynamics; directly following their preferences may cause instability

Research Motivation

Inspired by recently proposed model-free OFO with single-point residual estimation 8, the authors aim to develop the first work addressing real-time preference optimization with closed-loop guarantees.

Core Contributions

Novel OFO controller: Proposes the first online feedback optimization controller utilizing binary preference feedback to optimize user utility while ensuring closed-loop stability
Single-point evaluation scheme: Employs a random exploration scheme requiring only one utility evaluation per time step (rather than two), better suited for online implementation
Theoretical guarantees:
- Proves closed-loop system stability (Lemma 1: bounded expected Lyapunov function)
- Establishes optimality guarantees (Theorem 1: expected distance converges to O(μ, δ))
- Quantifies transient effects on performance
First closed-loop guarantee: To the authors' knowledge, this is the first work providing closed-loop guarantees for real-time preference optimization
Numerical validation: Validates theoretical results through thermal comfort optimization problems

Methodology Details

Task Definition

System model: Consider exponentially stable systems $x_{k+1} = f(x_k, u_k)$ where $x \in \mathbb{R}^{n_x}$ is system state, $u \in \mathbb{R}^{n_u}$ is control input, and there exists a unique steady-state input-state mapping $h: \mathbb{R}^{n_u} \rightarrow \mathbb{R}^{n_x}$ .

Optimization objective: Optimize user utility at steady-state $\min_{x,u} \Phi(x, u), \quad \text{s.t. } x = h(u)$ equivalent to the unconstrained problem: $\min_u \tilde{\Phi}(u), \quad \text{where } \tilde{\Phi}(u) = \Phi(h(u), u)$

Preference feedback model (Bradley-Terry model): $P(\mathbb{1}_{u_1 \succ u_2} = 1) = \sigma(\tilde{\Phi}(u_2) - \tilde{\Phi}(u_1))$ where $\sigma(t) = \frac{1}{1+e^{-t}}$ is the sigmoid function.

Key assumptions:

Input-state mapping $h$ is Lipschitz continuous
Utility function $\Phi(x,u)$ is Lipschitz continuous in $x$
$\tilde{\Phi}(u)$ is differentiable, Lipschitz continuous, smooth, and strongly convex

Model Architecture

Algorithm flow (Algorithm 1):

Input: step size η, smoothing parameter δ, initial input u₀, time steps T
for k = 1, ..., T-1:
    1. Add random exploration: xₖ₊₁ = f(xₖ, uₖ + δvₖ)
       where vₖ is uniformly sampled from (nᵤ-1)-dimensional unit sphere
    
    2. Collect preference feedback:
       Ask user to compare Φ(xₖ₊₁, uₖ + δvₖ) and Φ(xₖ, uₖ₋₁ + δvₖ₋₁)
       Sample 𝟙_{(xₖ₊₁,uₖ+δvₖ)≻(xₖ,uₖ₋₁+δvₖ₋₁)}
    
    3. Update control input:
       uₖ₊₁ = uₖ + (η/2δ)𝟙_{(xₖ₊₁,uₖ+δvₖ)≻(xₖ,uₖ₋₁+δvₖ₋₁)}vₖ
end for
Output: uₜ

Closed-loop system: $x_{k+1} = f(x_k, u_k + \delta v_k)$ $u_{k+1} = u_k + \frac{\eta}{2\delta}\mathbb{1}_{(x_{k+1},u_k+\delta v_k)\succ(x_k,u_{k-1}+\delta v_{k-1})}v_k$

Technical Innovations

Single-point residual estimation:
- Leverages $x_{k+1}$ to approximate $h(u_k + \delta v_k)$ , avoiding need for precise system model
- Requires only one utility evaluation per time step, versus two in traditional methods
- Based on comparisons between consecutive time steps, naturally incorporating temporal structure
Probabilistic gradient descent interpretation:
- Interprets update rule as gradient descent on probability function $p_{u'}(u) = P(\mathbb{1}_{u \succ u'} = 1)$
- Proves minimizing $p_{u'}(u)$ is equivalent to minimizing $\tilde{\Phi}(u)$ (Lemma 3)
- Rewrites update as: $u_{k+1} = u_k - \eta(\nabla p_{u_k}(u_k) + e_k)$
- where error term $e_k$ arises from approximating $h(u_k + \delta v_k)$ with $x_{k+1}$ and stochastic gradient estimation
Error analysis framework:
- Explicitly quantifies error bound $e_k$ (Lemma 4): $\|E[e_k|F_k]\| \leq \sqrt{R_1 V(x_{k-1}, u_{k-1}+\delta v_{k-1}) + R_2}$
- where $R_1 = O(\mu)$ , $R_2 = O(\mu, \delta^2)$ , and $\mu$ is system decay rate
- Faster system stabilization (smaller μ) yields smaller approximation error
Unified stability and optimality analysis:
- Analyzes stability via Lyapunov function (Lemma 1)
- Analyzes optimality via expected distance $E[\|u_k - u^*\|^2]$ (Theorem 1)
- Connects both through system transient behavior

Theoretical Results

Stability (Lemma 1): $E[V(x_k, u_k+\delta v_k)] \leq \mu^k E[V(x_0, u_0+\delta v_0)] + \frac{a_1}{1-\mu}(2\delta^2 + \eta + (\frac{\eta}{2\delta})^2)$ where $\mu = \frac{2\alpha_2}{\alpha_1}(1-\frac{\alpha_3}{\alpha_2}) < 1$ .

Optimality (Theorem 1): $E[\|u_k - u^*\|^2] \leq (\frac{1+\rho}{2})^{k-k'} E[\|u_{k'} - u^*\|^2] + O(\mu, \mu^{k'}, \delta)$ where $\rho = 1 - 2\sigma'(0)m\eta$ .

Key insights:

Steady-state error characterized by $O(\mu, \delta)$
Faster system stabilization (smaller μ) leads to better performance
Exploration-exploitation tradeoff exists (choice of δ)

Experimental Setup

Datasets/System Models

Experiment 1: Quadratic problem

System: LTI system $x_{k+1} = Ax_k + Bu_k$
Matrices: $A = \begin{bmatrix} c & 1 \\ 0 & c \end{bmatrix}$ , $B$ is identity matrix
Parameter variation: $c \in \{0.1, 0.7\}$ to test different decay rates
Optimization objective: $\min (x-x_{ref})^\top(x-x_{ref})$ , where $x_{ref} = [100, 100]^\top$
Steady-state mapping: $H = (I-A)^{-1}B$

Experiment 2: Thermal comfort optimization

System: 13-state building LTI model 27
Utility function: PMV (Predictive Mean Vote) model 3
Evaluation metric: PPD (Predicted Percentage of Dissatisfied) index
Objective: Identify indoor temperature minimizing PPD
User settings: Typing activity, wearing athletic pants, T-shirt, and shoes

Evaluation Metrics

Relative error: $\|x_k - x_{ref}\|/\|x_{ref}\|$ (log scale)
Temperature tracking: Difference between actual and optimal temperature
Steady-state variance: Algorithm fluctuation at steady-state
Overshoot: Maximum deviation during convergence

Comparison Methods

Algebraic system (orange line): Assumes $H$ is known, directly samples $\mathbb{1}_{u_k+\delta v_k \succ u_{k-1}+\delta v_{k-1}}$
Noiseless user model: $\mathbb{1} = \text{sign}(\Phi(x_k, u_{k-1}+\delta v_{k-1}) - \Phi(x_{k+1}, u_k+\delta v_k))$
Proposed method (blue line): Complete Algorithm 1

Implementation Details

Step size: $\eta = 0.1$
Smoothing parameter: $\delta = 0.5$
Number of simulations: 20 independent runs
Statistical display: Solid line represents mean, shaded region represents one standard deviation
Initial conditions: $u_0$ randomly initialized

Experimental Results

Main Results

Experiment 1: Quadratic problem

System Parameter	Convergence Speed	Steady-state Accuracy	Overshoot	Steady-state Variance
c=0.1 (fast)	Fast	High	Small	Small
c=0.7 (slow)	Slow	Comparable	Large	Large

Key findings:

Steady-state performance: Proposed method (blue line) and algebraic system (orange line) achieve comparable accuracy levels at steady-state
Transient effects: For slower systems (c=0.7), proposed method exhibits larger overshoot and higher steady-state variance
Theory validation: Experimental results align with theoretical predictions—system decay rate μ affects performance

Experiment 2: Thermal comfort optimization

Convergence: Algorithm successfully tracks optimal temperature (black horizontal line)
Noise impact:
- Noisy feedback (blue line): Slower convergence with oscillations
- Noiseless feedback (orange line): Faster convergence, more stable
Practicality: Through careful tuning of η and δ, controller effectively tracks optimal point without significant overshoot

Experimental Findings

Importance of system dynamics:
- System transient significantly affects algorithm performance
- Fast-stabilizing systems (small μ) achieve better tracking performance
- Validates theoretical results regarding μ in Lemma 1 and Theorem 1
Parameter tradeoffs:
- δ: Smaller δ reduces exploration noise but may lead to local optima
- η: Must balance convergence speed and stability
- Exploration-exploitation tradeoff exists
User model impact:
- Bradley-Terry model (probabilistic feedback) introduces additional noise
- Deterministic feedback significantly improves performance
- Motivates future research on alternative user models
Practical application potential:
- Thermal comfort optimization demonstrates practical application potential for learning human utility
- Single-point evaluation scheme suits online implementation
- Algorithm is robust to initial conditions

Online Feedback Optimization (OFO)

Applications: Grid control 5 and robot coordination 6
Theoretical guarantees: First-order 7 and zeroth-order 8 formulations
Limitations: Requires precise utility values or gradient information

Offline Preference Optimization

Finite action spaces:

Optimality concepts: Copeland winner 10, Borda winner 11
Algorithms: Random exploration 12, greedy search 13

Continuous action spaces:

GP modeling: Model latent utility with Gaussian processes
Heuristic policies: Balance exploration and exploitation [14]15
Regret guarantees: When utility lies in RKHS [16]17

Gradient estimation:

Existing methods [18]19: Require two evaluations per step
This paper's method: Requires only one evaluation, better suited for online scenarios

Differentiation of This Work

First closed-loop guarantee: Real-time preference optimization considering system transients
Single-point evaluation: Higher computational efficiency
Theoretical completeness: Provides both stability and optimality guarantees
Practicality: Suitable for real engineering systems

Conclusions and Discussion

Main Conclusions

Theoretical contributions:
- Develops first human-aware controller leveraging preference feedback with closed-loop guarantees
- Explicitly quantifies transient effects on performance
- Establishes theoretical guarantees for stability and optimality
Method advantages:
- Requires only one utility evaluation per step
- No need for precise system model
- Handles time-varying utility and external disturbances
Experimental validation:
- Theoretical results validated in numerical experiments
- Demonstrates practical application potential in thermal comfort optimization

Limitations

Assumption constraints:
- Strong convexity assumption may be restrictive in some applications
- Bradley-Terry model assumes completely rational human behavior, but humans are not always rational 9
- Requires exponentially stable systems
Steady-state error:
- Exists $O(\mu, \delta)$ steady-state error
- Cannot be completely eliminated, only reduced through parameter tuning
- Performance may degrade for very slow systems
User burden:
- Requires user feedback at each time step
- May cause user fatigue in practical applications
- Does not consider user feedback delays
Theory-practice gap:
- Theoretical analysis for deterministic feedback model not yet established
- Experiments show noiseless model performs better, but lacks theoretical support
Computational complexity:
- Scalability to large-scale systems not discussed
- Random exploration may be inefficient in high-dimensional spaces

Future Directions

Explicitly proposed directions:

Extend theoretical framework to alternative user models (e.g., noiseless models)
Practical applications: Product design, chemical selection, etc.
Relax assumptions: Non-convex utility functions, unstable systems
Multi-agent scenarios: Preference aggregation from multiple users

Potential research directions: 5. Adaptive parameter adjustment: Online tuning of η and δ 6. User fatigue modeling: Reduce feedback frequency 7. Delayed feedback: Handle user response delays 8. High-dimensional optimization: More efficient exploration strategies

In-depth Evaluation

Strengths

Theoretical rigor:

Complete theoretical framework: Full analysis chain from stability (Lemma 1) to optimality (Theorem 1)
Explicit error bounds: Clearly quantifies approximation errors (Lemma 4) rather than just asymptotic results
Mild assumptions: While strong convexity is assumed, other assumptions (Lipschitz continuity) are common in practice
Complete proofs: All major results have detailed proofs (appendix)

Method innovation:

Pioneering work: First to combine preference feedback with closed-loop control, filling research gap
Single-point evaluation: Reduces evaluation count by 50% compared to existing methods, significantly improving practicality
Unified framework: Unifies stability and optimality analysis in single framework
Probabilistic interpretation: Converts binary feedback to probabilistic gradient descent, providing intuitive understanding

Experimental design:

Progressive validation: From simple quadratic problems to practical thermal comfort problems
Parameter sensitivity analysis: Tests system dynamics impact through different c values
Statistical reliability: 20 independent runs providing mean and variance
Practical relevance: Thermal comfort optimization is real-world application scenario

Writing quality:

Clear structure: Logical progression from problem definition through theory to experiments
Consistent notation: Mathematical symbols used consistently and standardly
Intuitive explanations: Multiple Remarks provide intuitive explanations beyond technical details

Weaknesses

Theoretical limitations:

Strong convexity assumption: Limits applicability; many practical utility functions (e.g., PPD) are non-convex
Asymptotic results: Theorem 1 provides bounds dependent on arbitrary fixed k', without explicit finite-time convergence rates
Constant dependence: Constants in $O(\mu, \delta)$ may be large, theoretical bounds potentially conservative
Missing deterministic model: Experiments show noiseless model performs better, but lacks theoretical analysis

Experimental insufficiencies:

Limited comparison methods:
- No comparison with other preference learning methods (e.g., GP-based methods [14]15)
- No comparison with traditional adaptive control methods
- Only compares with algebraic system and noiseless model
Parameter tuning:
- No systematic study of η and δ selection strategies
- No parameter selection guidelines provided
- Experimental parameters appear manually tuned
Scale limitations:
- Only tests low-dimensional systems (2D and 13D)
- Scalability to high dimensions not verified
Missing real user testing:
- All experiments based on simulated user models
- No real human subject experiments
- Cannot verify actual effectiveness of Bradley-Terry model

Method limitations:

Exploration efficiency: Uniform sphere sampling may be inefficient in high dimensions
Cold start problem: Algorithm requires initial u₀, selection strategy not discussed
Robustness: No analysis of robustness to model mismatch, measurement noise
Computational cost: Per-step computational complexity not discussed

Practical considerations:

User burden: Requires user feedback at each step, potentially causing fatigue
Feedback quality: Assumes users can provide accurate preferences, but actual inconsistency possible
Safety constraints: Does not consider state and input constraints, important in real systems
Multi-objective optimization: Only considers single utility function

Impact

Contributions to field:

Pioneering work: Opens new research direction in real-time preference optimization
Theoretical foundation: Provides theoretical framework and analysis tools for subsequent research
Interdisciplinary bridge: Connects control theory, optimization, and human-computer interaction
Application potential: Offers new insights for human-aware system design

Expected impact:

Short-term: Likely to inspire more research on preference feedback control
Medium-term: Potential applications in building control, personalized recommendations
Long-term: May influence design paradigm for human-machine interaction systems

Limitations:

Strong assumptions may limit practical applications
Lack of real user experiments may affect credibility
Requires more engineering work for actual deployment

Applicable Scenarios

Ideal application scenarios:

Building control:
- Personalized temperature adjustment
- Lighting control
- Air quality management
- Advantage: Relatively slow system dynamics, users can provide continuous feedback
Personalized recommendations:
- Product recommendations
- Content recommendations
- Advantage: Users accustomed to providing comparative feedback
Healthcare:
- Personalized treatment plan adjustment
- Rehabilitation training intensity adjustment
- Advantage: Emphasizes individual differences
Human-machine collaboration:
- Robot-assisted tasks
- Autonomous vehicle personalization
- Advantage: Requires real-time adaptation to user preferences

Unsuitable scenarios:

Fast-dynamics systems: High-frequency trading, flight control (users cannot provide timely feedback)
High-dimensional complex systems: Low exploration efficiency
Strict safety constraints: Does not handle constraints, potentially unsafe
Multi-objective conflicts: Only considers single utility
Non-convex optimization: Theoretical guarantees fail

Improvement suggestions:

Combine with active learning to reduce user feedback frequency
Introduce safety filters to handle constraints
Extend to multi-objective scenarios
Develop adaptive parameter adjustment strategies

References

Key references:

8 Z. He et al., 2023 - Model-free nonlinear feedback optimization
- Main theoretical foundation of this paper
- Provides single-point residual estimation concept
18 Y. Yue & T. Joachims, 2009 - Interactively optimizing information retrieval
- Classical work on preference feedback gradient estimation
- This paper improves upon its two-evaluation requirement
16 W. Xu et al., 2024 - Principled preferential Bayesian optimization
- Recent advances in preference Bayesian optimization
- Provides comparison baseline for GP-based methods
27 Y. Lian et al., 2023 - Adaptive robust data-driven building control
- Practical system model for building control
- Provides realistic scenario for experiments
9 D. Kahneman & A. Tversky, 2013 - Prospect theory
- Non-rational human decision-making behavior
- Highlights limitations of user model assumptions

Overall Assessment: This is an excellent paper with rigorous theory and strong innovation, successfully combining preference learning with closed-loop control and providing new theoretical framework for human-machine interaction system design. Main contributions lie in first-time provision of stability and optimality guarantees for real-time preference optimization, with practical method value (single-point evaluation). However, strong convexity assumptions, lack of real user experiments, and limited comparison experiments are main weaknesses. Future work should focus on relaxing assumptions, conducting real user studies, and extending to more complex practical scenarios. For researchers working on human-machine interaction control, preference learning, or online optimization, this paper merits in-depth study.