2025-11-24T00:22:17.812402

Human-in-the-loop: Real-time Preference Optimization

Wang, Xu, Jones
Optimization with preference feedback is an active research area with many applications in engineering systems where humans play a central role, such as building control and autonomous vehicles. While most existing studies focus on optimizing a static user utility, few have investigated its closed-loop behavior that accounts for system transients. In this work, we propose an online feedback optimization controller that can optimize user utility using pairwise comparison feedback with both optimality and closed-loop stability guarantees. By adding a random exploration signal, the controller estimates the gradient based on the binary utility comparison feedback between two consecutive time steps. We analyze its closed-loop behavior when interacting with a nonlinear plant and show that, under mild assumptions, the controller converges to the optimal point without inducing instability. Theoretical findings are further validated through numerical experiments.
academic

Human-in-the-loop: Real-time Preference Optimization

Basic Information

  • Paper ID: 2506.02225
  • Title: Human-in-the-loop: Real-time Preference Optimization
  • Authors: Wenbin Wang, Wenjie Xu, Colin N. Jones (Automatic Control Laboratory, EPFL)
  • Classification: math.OC (Optimization and Control)
  • Publication Date: arXiv preprint, November 3, 2025 (v2)
  • Paper Link: https://arxiv.org/abs/2506.02225

Abstract

This paper investigates optimization problems with preference feedback, which have broad applications in human-centric engineering systems such as building control and autonomous driving. Existing research primarily focuses on static user utility optimization with limited consideration of closed-loop transient behavior. The paper proposes an online feedback optimization controller that leverages pairwise comparison feedback to optimize user utility while providing optimality and closed-loop stability guarantees. By incorporating random exploration signals, the controller estimates gradients based on binary utility comparisons between consecutive time steps. The authors analyze the closed-loop behavior when the controller interacts with nonlinear systems and prove that under mild assumptions, the controller converges to the optimal point without inducing instability. Theoretical findings are validated through numerical experiments.

Research Background and Motivation

Problems to be Addressed

  1. Human-machine interactive control: How to design human-aware controllers that optimize user utility in real-time, enabling systems to adapt based on user preferences
  2. Real-time optimization with preference feedback: How to perform online optimization using binary preference comparisons rather than absolute utility values
  3. Closed-loop stability guarantees: How to ensure the optimization process does not destabilize the system while considering transient behavior

Problem Significance

  • Individual differences: Traditional controllers track predefined reference points based on large-scale population models (e.g., indoor temperature in building control), introducing bias and suboptimal performance due to inability to account for individual variations
  • Time-varying utility: Without real-time human feedback, controllers cannot respond to time-varying utility and lack robustness to external disturbances
  • Human cognitive characteristics: Humans excel at relative comparisons rather than absolute assessments, making preference feedback naturally occur in pairwise comparison form

Limitations of Existing Methods

  1. Online feedback optimization (OFO): Existing OFO methods (e.g., grid control, robot coordination) require precise utility values or gradient information, making direct application to human preference feedback challenging
  2. Offline preference optimization:
    • Most research considers static problems, ignoring system transient behavior
    • Existing gradient estimation methods (e.g., 18,19) require two function evaluations per time step, unsuitable for online implementation
    • Lack of closed-loop stability analysis
  3. Stability quantification difficulty: The binary nature of preference feedback makes overall dynamics highly nonlinear, complicating stability analysis
  4. Limited user knowledge: Users typically have limited understanding of system dynamics; directly following their preferences may cause instability

Research Motivation

Inspired by recently proposed model-free OFO with single-point residual estimation 8, the authors aim to develop the first work addressing real-time preference optimization with closed-loop guarantees.

Core Contributions

  1. Novel OFO controller: Proposes the first online feedback optimization controller utilizing binary preference feedback to optimize user utility while ensuring closed-loop stability
  2. Single-point evaluation scheme: Employs a random exploration scheme requiring only one utility evaluation per time step (rather than two), better suited for online implementation
  3. Theoretical guarantees:
    • Proves closed-loop system stability (Lemma 1: bounded expected Lyapunov function)
    • Establishes optimality guarantees (Theorem 1: expected distance converges to O(μ, δ))
    • Quantifies transient effects on performance
  4. First closed-loop guarantee: To the authors' knowledge, this is the first work providing closed-loop guarantees for real-time preference optimization
  5. Numerical validation: Validates theoretical results through thermal comfort optimization problems

Methodology Details

Task Definition

System model: Consider exponentially stable systems xk+1=f(xk,uk)x_{k+1} = f(x_k, u_k) where xRnxx \in \mathbb{R}^{n_x} is system state, uRnuu \in \mathbb{R}^{n_u} is control input, and there exists a unique steady-state input-state mapping h:RnuRnxh: \mathbb{R}^{n_u} \rightarrow \mathbb{R}^{n_x}.

Optimization objective: Optimize user utility at steady-state minx,uΦ(x,u),s.t. x=h(u)\min_{x,u} \Phi(x, u), \quad \text{s.t. } x = h(u) equivalent to the unconstrained problem: minuΦ~(u),where Φ~(u)=Φ(h(u),u)\min_u \tilde{\Phi}(u), \quad \text{where } \tilde{\Phi}(u) = \Phi(h(u), u)

Preference feedback model (Bradley-Terry model): P(1u1u2=1)=σ(Φ~(u2)Φ~(u1))P(\mathbb{1}_{u_1 \succ u_2} = 1) = \sigma(\tilde{\Phi}(u_2) - \tilde{\Phi}(u_1)) where σ(t)=11+et\sigma(t) = \frac{1}{1+e^{-t}} is the sigmoid function.

Key assumptions:

  1. Input-state mapping hh is Lipschitz continuous
  2. Utility function Φ(x,u)\Phi(x,u) is Lipschitz continuous in xx
  3. Φ~(u)\tilde{\Phi}(u) is differentiable, Lipschitz continuous, smooth, and strongly convex

Model Architecture

Algorithm flow (Algorithm 1):

Input: step size η, smoothing parameter δ, initial input u₀, time steps T
for k = 1, ..., T-1:
    1. Add random exploration: xₖ₊₁ = f(xₖ, uₖ + δvₖ)
       where vₖ is uniformly sampled from (nᵤ-1)-dimensional unit sphere
    
    2. Collect preference feedback:
       Ask user to compare Φ(xₖ₊₁, uₖ + δvₖ) and Φ(xₖ, uₖ₋₁ + δvₖ₋₁)
       Sample 𝟙_{(xₖ₊₁,uₖ+δvₖ)≻(xₖ,uₖ₋₁+δvₖ₋₁)}
    
    3. Update control input:
       uₖ₊₁ = uₖ + (η/2δ)𝟙_{(xₖ₊₁,uₖ+δvₖ)≻(xₖ,uₖ₋₁+δvₖ₋₁)}vₖ
end for
Output: uₜ

Closed-loop system: xk+1=f(xk,uk+δvk)x_{k+1} = f(x_k, u_k + \delta v_k)uk+1=uk+η2δ1(xk+1,uk+δvk)(xk,uk1+δvk1)vku_{k+1} = u_k + \frac{\eta}{2\delta}\mathbb{1}_{(x_{k+1},u_k+\delta v_k)\succ(x_k,u_{k-1}+\delta v_{k-1})}v_k

Technical Innovations

  1. Single-point residual estimation:
    • Leverages xk+1x_{k+1} to approximate h(uk+δvk)h(u_k + \delta v_k), avoiding need for precise system model
    • Requires only one utility evaluation per time step, versus two in traditional methods
    • Based on comparisons between consecutive time steps, naturally incorporating temporal structure
  2. Probabilistic gradient descent interpretation:
    • Interprets update rule as gradient descent on probability function pu(u)=P(1uu=1)p_{u'}(u) = P(\mathbb{1}_{u \succ u'} = 1)
    • Proves minimizing pu(u)p_{u'}(u) is equivalent to minimizing Φ~(u)\tilde{\Phi}(u) (Lemma 3)
    • Rewrites update as: uk+1=ukη(puk(uk)+ek)u_{k+1} = u_k - \eta(\nabla p_{u_k}(u_k) + e_k)
    • where error term eke_k arises from approximating h(uk+δvk)h(u_k + \delta v_k) with xk+1x_{k+1} and stochastic gradient estimation
  3. Error analysis framework:
    • Explicitly quantifies error bound eke_k (Lemma 4): E[ekFk]R1V(xk1,uk1+δvk1)+R2\|E[e_k|F_k]\| \leq \sqrt{R_1 V(x_{k-1}, u_{k-1}+\delta v_{k-1}) + R_2}
    • where R1=O(μ)R_1 = O(\mu), R2=O(μ,δ2)R_2 = O(\mu, \delta^2), and μ\mu is system decay rate
    • Faster system stabilization (smaller μ) yields smaller approximation error
  4. Unified stability and optimality analysis:
    • Analyzes stability via Lyapunov function (Lemma 1)
    • Analyzes optimality via expected distance E[uku2]E[\|u_k - u^*\|^2] (Theorem 1)
    • Connects both through system transient behavior

Theoretical Results

Stability (Lemma 1): E[V(xk,uk+δvk)]μkE[V(x0,u0+δv0)]+a11μ(2δ2+η+(η2δ)2)E[V(x_k, u_k+\delta v_k)] \leq \mu^k E[V(x_0, u_0+\delta v_0)] + \frac{a_1}{1-\mu}(2\delta^2 + \eta + (\frac{\eta}{2\delta})^2) where μ=2α2α1(1α3α2)<1\mu = \frac{2\alpha_2}{\alpha_1}(1-\frac{\alpha_3}{\alpha_2}) < 1.

Optimality (Theorem 1): E[uku2](1+ρ2)kkE[uku2]+O(μ,μk,δ)E[\|u_k - u^*\|^2] \leq (\frac{1+\rho}{2})^{k-k'} E[\|u_{k'} - u^*\|^2] + O(\mu, \mu^{k'}, \delta) where ρ=12σ(0)mη\rho = 1 - 2\sigma'(0)m\eta.

Key insights:

  • Steady-state error characterized by O(μ,δ)O(\mu, \delta)
  • Faster system stabilization (smaller μ) leads to better performance
  • Exploration-exploitation tradeoff exists (choice of δ)

Experimental Setup

Datasets/System Models

Experiment 1: Quadratic problem

  • System: LTI system xk+1=Axk+Bukx_{k+1} = Ax_k + Bu_k
  • Matrices: A=[c10c]A = \begin{bmatrix} c & 1 \\ 0 & c \end{bmatrix}, BB is identity matrix
  • Parameter variation: c{0.1,0.7}c \in \{0.1, 0.7\} to test different decay rates
  • Optimization objective: min(xxref)(xxref)\min (x-x_{ref})^\top(x-x_{ref}), where xref=[100,100]x_{ref} = [100, 100]^\top
  • Steady-state mapping: H=(IA)1BH = (I-A)^{-1}B

Experiment 2: Thermal comfort optimization

  • System: 13-state building LTI model 27
  • Utility function: PMV (Predictive Mean Vote) model 3
  • Evaluation metric: PPD (Predicted Percentage of Dissatisfied) index
  • Objective: Identify indoor temperature minimizing PPD
  • User settings: Typing activity, wearing athletic pants, T-shirt, and shoes

Evaluation Metrics

  1. Relative error: xkxref/xref\|x_k - x_{ref}\|/\|x_{ref}\| (log scale)
  2. Temperature tracking: Difference between actual and optimal temperature
  3. Steady-state variance: Algorithm fluctuation at steady-state
  4. Overshoot: Maximum deviation during convergence

Comparison Methods

  1. Algebraic system (orange line): Assumes HH is known, directly samples 1uk+δvkuk1+δvk1\mathbb{1}_{u_k+\delta v_k \succ u_{k-1}+\delta v_{k-1}}
  2. Noiseless user model: 1=sign(Φ(xk,uk1+δvk1)Φ(xk+1,uk+δvk))\mathbb{1} = \text{sign}(\Phi(x_k, u_{k-1}+\delta v_{k-1}) - \Phi(x_{k+1}, u_k+\delta v_k))
  3. Proposed method (blue line): Complete Algorithm 1

Implementation Details

  • Step size: η=0.1\eta = 0.1
  • Smoothing parameter: δ=0.5\delta = 0.5
  • Number of simulations: 20 independent runs
  • Statistical display: Solid line represents mean, shaded region represents one standard deviation
  • Initial conditions: u0u_0 randomly initialized

Experimental Results

Main Results

Experiment 1: Quadratic problem

System ParameterConvergence SpeedSteady-state AccuracyOvershootSteady-state Variance
c=0.1 (fast)FastHighSmallSmall
c=0.7 (slow)SlowComparableLargeLarge

Key findings:

  1. Steady-state performance: Proposed method (blue line) and algebraic system (orange line) achieve comparable accuracy levels at steady-state
  2. Transient effects: For slower systems (c=0.7), proposed method exhibits larger overshoot and higher steady-state variance
  3. Theory validation: Experimental results align with theoretical predictions—system decay rate μ affects performance

Experiment 2: Thermal comfort optimization

  • Convergence: Algorithm successfully tracks optimal temperature (black horizontal line)
  • Noise impact:
    • Noisy feedback (blue line): Slower convergence with oscillations
    • Noiseless feedback (orange line): Faster convergence, more stable
  • Practicality: Through careful tuning of η and δ, controller effectively tracks optimal point without significant overshoot

Experimental Findings

  1. Importance of system dynamics:
    • System transient significantly affects algorithm performance
    • Fast-stabilizing systems (small μ) achieve better tracking performance
    • Validates theoretical results regarding μ in Lemma 1 and Theorem 1
  2. Parameter tradeoffs:
    • δ: Smaller δ reduces exploration noise but may lead to local optima
    • η: Must balance convergence speed and stability
    • Exploration-exploitation tradeoff exists
  3. User model impact:
    • Bradley-Terry model (probabilistic feedback) introduces additional noise
    • Deterministic feedback significantly improves performance
    • Motivates future research on alternative user models
  4. Practical application potential:
    • Thermal comfort optimization demonstrates practical application potential for learning human utility
    • Single-point evaluation scheme suits online implementation
    • Algorithm is robust to initial conditions

Online Feedback Optimization (OFO)

  • Applications: Grid control 5 and robot coordination 6
  • Theoretical guarantees: First-order 7 and zeroth-order 8 formulations
  • Limitations: Requires precise utility values or gradient information

Offline Preference Optimization

Finite action spaces:

  • Optimality concepts: Copeland winner 10, Borda winner 11
  • Algorithms: Random exploration 12, greedy search 13

Continuous action spaces:

  • GP modeling: Model latent utility with Gaussian processes
  • Heuristic policies: Balance exploration and exploitation [14]15
  • Regret guarantees: When utility lies in RKHS [16]17

Gradient estimation:

  • Existing methods [18]19: Require two evaluations per step
  • This paper's method: Requires only one evaluation, better suited for online scenarios

Differentiation of This Work

  1. First closed-loop guarantee: Real-time preference optimization considering system transients
  2. Single-point evaluation: Higher computational efficiency
  3. Theoretical completeness: Provides both stability and optimality guarantees
  4. Practicality: Suitable for real engineering systems

Conclusions and Discussion

Main Conclusions

  1. Theoretical contributions:
    • Develops first human-aware controller leveraging preference feedback with closed-loop guarantees
    • Explicitly quantifies transient effects on performance
    • Establishes theoretical guarantees for stability and optimality
  2. Method advantages:
    • Requires only one utility evaluation per step
    • No need for precise system model
    • Handles time-varying utility and external disturbances
  3. Experimental validation:
    • Theoretical results validated in numerical experiments
    • Demonstrates practical application potential in thermal comfort optimization

Limitations

  1. Assumption constraints:
    • Strong convexity assumption may be restrictive in some applications
    • Bradley-Terry model assumes completely rational human behavior, but humans are not always rational 9
    • Requires exponentially stable systems
  2. Steady-state error:
    • Exists O(μ,δ)O(\mu, \delta) steady-state error
    • Cannot be completely eliminated, only reduced through parameter tuning
    • Performance may degrade for very slow systems
  3. User burden:
    • Requires user feedback at each time step
    • May cause user fatigue in practical applications
    • Does not consider user feedback delays
  4. Theory-practice gap:
    • Theoretical analysis for deterministic feedback model not yet established
    • Experiments show noiseless model performs better, but lacks theoretical support
  5. Computational complexity:
    • Scalability to large-scale systems not discussed
    • Random exploration may be inefficient in high-dimensional spaces

Future Directions

Explicitly proposed directions:

  1. Extend theoretical framework to alternative user models (e.g., noiseless models)
  2. Practical applications: Product design, chemical selection, etc.
  3. Relax assumptions: Non-convex utility functions, unstable systems
  4. Multi-agent scenarios: Preference aggregation from multiple users

Potential research directions: 5. Adaptive parameter adjustment: Online tuning of η and δ 6. User fatigue modeling: Reduce feedback frequency 7. Delayed feedback: Handle user response delays 8. High-dimensional optimization: More efficient exploration strategies

In-depth Evaluation

Strengths

Theoretical rigor:

  1. Complete theoretical framework: Full analysis chain from stability (Lemma 1) to optimality (Theorem 1)
  2. Explicit error bounds: Clearly quantifies approximation errors (Lemma 4) rather than just asymptotic results
  3. Mild assumptions: While strong convexity is assumed, other assumptions (Lipschitz continuity) are common in practice
  4. Complete proofs: All major results have detailed proofs (appendix)

Method innovation:

  1. Pioneering work: First to combine preference feedback with closed-loop control, filling research gap
  2. Single-point evaluation: Reduces evaluation count by 50% compared to existing methods, significantly improving practicality
  3. Unified framework: Unifies stability and optimality analysis in single framework
  4. Probabilistic interpretation: Converts binary feedback to probabilistic gradient descent, providing intuitive understanding

Experimental design:

  1. Progressive validation: From simple quadratic problems to practical thermal comfort problems
  2. Parameter sensitivity analysis: Tests system dynamics impact through different c values
  3. Statistical reliability: 20 independent runs providing mean and variance
  4. Practical relevance: Thermal comfort optimization is real-world application scenario

Writing quality:

  1. Clear structure: Logical progression from problem definition through theory to experiments
  2. Consistent notation: Mathematical symbols used consistently and standardly
  3. Intuitive explanations: Multiple Remarks provide intuitive explanations beyond technical details

Weaknesses

Theoretical limitations:

  1. Strong convexity assumption: Limits applicability; many practical utility functions (e.g., PPD) are non-convex
  2. Asymptotic results: Theorem 1 provides bounds dependent on arbitrary fixed k', without explicit finite-time convergence rates
  3. Constant dependence: Constants in O(μ,δ)O(\mu, \delta) may be large, theoretical bounds potentially conservative
  4. Missing deterministic model: Experiments show noiseless model performs better, but lacks theoretical analysis

Experimental insufficiencies:

  1. Limited comparison methods:
    • No comparison with other preference learning methods (e.g., GP-based methods [14]15)
    • No comparison with traditional adaptive control methods
    • Only compares with algebraic system and noiseless model
  2. Parameter tuning:
    • No systematic study of η and δ selection strategies
    • No parameter selection guidelines provided
    • Experimental parameters appear manually tuned
  3. Scale limitations:
    • Only tests low-dimensional systems (2D and 13D)
    • Scalability to high dimensions not verified
  4. Missing real user testing:
    • All experiments based on simulated user models
    • No real human subject experiments
    • Cannot verify actual effectiveness of Bradley-Terry model

Method limitations:

  1. Exploration efficiency: Uniform sphere sampling may be inefficient in high dimensions
  2. Cold start problem: Algorithm requires initial u₀, selection strategy not discussed
  3. Robustness: No analysis of robustness to model mismatch, measurement noise
  4. Computational cost: Per-step computational complexity not discussed

Practical considerations:

  1. User burden: Requires user feedback at each step, potentially causing fatigue
  2. Feedback quality: Assumes users can provide accurate preferences, but actual inconsistency possible
  3. Safety constraints: Does not consider state and input constraints, important in real systems
  4. Multi-objective optimization: Only considers single utility function

Impact

Contributions to field:

  1. Pioneering work: Opens new research direction in real-time preference optimization
  2. Theoretical foundation: Provides theoretical framework and analysis tools for subsequent research
  3. Interdisciplinary bridge: Connects control theory, optimization, and human-computer interaction
  4. Application potential: Offers new insights for human-aware system design

Expected impact:

  • Short-term: Likely to inspire more research on preference feedback control
  • Medium-term: Potential applications in building control, personalized recommendations
  • Long-term: May influence design paradigm for human-machine interaction systems

Limitations:

  • Strong assumptions may limit practical applications
  • Lack of real user experiments may affect credibility
  • Requires more engineering work for actual deployment

Applicable Scenarios

Ideal application scenarios:

  1. Building control:
    • Personalized temperature adjustment
    • Lighting control
    • Air quality management
    • Advantage: Relatively slow system dynamics, users can provide continuous feedback
  2. Personalized recommendations:
    • Product recommendations
    • Content recommendations
    • Advantage: Users accustomed to providing comparative feedback
  3. Healthcare:
    • Personalized treatment plan adjustment
    • Rehabilitation training intensity adjustment
    • Advantage: Emphasizes individual differences
  4. Human-machine collaboration:
    • Robot-assisted tasks
    • Autonomous vehicle personalization
    • Advantage: Requires real-time adaptation to user preferences

Unsuitable scenarios:

  1. Fast-dynamics systems: High-frequency trading, flight control (users cannot provide timely feedback)
  2. High-dimensional complex systems: Low exploration efficiency
  3. Strict safety constraints: Does not handle constraints, potentially unsafe
  4. Multi-objective conflicts: Only considers single utility
  5. Non-convex optimization: Theoretical guarantees fail

Improvement suggestions:

  • Combine with active learning to reduce user feedback frequency
  • Introduce safety filters to handle constraints
  • Extend to multi-objective scenarios
  • Develop adaptive parameter adjustment strategies

References

Key references:

  1. 8 Z. He et al., 2023 - Model-free nonlinear feedback optimization
    • Main theoretical foundation of this paper
    • Provides single-point residual estimation concept
  2. 18 Y. Yue & T. Joachims, 2009 - Interactively optimizing information retrieval
    • Classical work on preference feedback gradient estimation
    • This paper improves upon its two-evaluation requirement
  3. 16 W. Xu et al., 2024 - Principled preferential Bayesian optimization
    • Recent advances in preference Bayesian optimization
    • Provides comparison baseline for GP-based methods
  4. 27 Y. Lian et al., 2023 - Adaptive robust data-driven building control
    • Practical system model for building control
    • Provides realistic scenario for experiments
  5. 9 D. Kahneman & A. Tversky, 2013 - Prospect theory
    • Non-rational human decision-making behavior
    • Highlights limitations of user model assumptions

Overall Assessment: This is an excellent paper with rigorous theory and strong innovation, successfully combining preference learning with closed-loop control and providing new theoretical framework for human-machine interaction system design. Main contributions lie in first-time provision of stability and optimality guarantees for real-time preference optimization, with practical method value (single-point evaluation). However, strong convexity assumptions, lack of real user experiments, and limited comparison experiments are main weaknesses. Future work should focus on relaxing assumptions, conducting real user studies, and extending to more complex practical scenarios. For researchers working on human-machine interaction control, preference learning, or online optimization, this paper merits in-depth study.