Optimization with preference feedback is an active research area with many applications in engineering systems where humans play a central role, such as building control and autonomous vehicles. While most existing studies focus on optimizing a static user utility, few have investigated its closed-loop behavior that accounts for system transients. In this work, we propose an online feedback optimization controller that can optimize user utility using pairwise comparison feedback with both optimality and closed-loop stability guarantees. By adding a random exploration signal, the controller estimates the gradient based on the binary utility comparison feedback between two consecutive time steps. We analyze its closed-loop behavior when interacting with a nonlinear plant and show that, under mild assumptions, the controller converges to the optimal point without inducing instability. Theoretical findings are further validated through numerical experiments.
This paper investigates optimization problems with preference feedback, which have broad applications in human-centric engineering systems such as building control and autonomous driving. Existing research primarily focuses on static user utility optimization with limited consideration of closed-loop transient behavior. The paper proposes an online feedback optimization controller that leverages pairwise comparison feedback to optimize user utility while providing optimality and closed-loop stability guarantees. By incorporating random exploration signals, the controller estimates gradients based on binary utility comparisons between consecutive time steps. The authors analyze the closed-loop behavior when the controller interacts with nonlinear systems and prove that under mild assumptions, the controller converges to the optimal point without inducing instability. Theoretical findings are validated through numerical experiments.
Human-machine interactive control: How to design human-aware controllers that optimize user utility in real-time, enabling systems to adapt based on user preferences
Real-time optimization with preference feedback: How to perform online optimization using binary preference comparisons rather than absolute utility values
Closed-loop stability guarantees: How to ensure the optimization process does not destabilize the system while considering transient behavior
Individual differences: Traditional controllers track predefined reference points based on large-scale population models (e.g., indoor temperature in building control), introducing bias and suboptimal performance due to inability to account for individual variations
Time-varying utility: Without real-time human feedback, controllers cannot respond to time-varying utility and lack robustness to external disturbances
Human cognitive characteristics: Humans excel at relative comparisons rather than absolute assessments, making preference feedback naturally occur in pairwise comparison form
Inspired by recently proposed model-free OFO with single-point residual estimation 8, the authors aim to develop the first work addressing real-time preference optimization with closed-loop guarantees.
Novel OFO controller: Proposes the first online feedback optimization controller utilizing binary preference feedback to optimize user utility while ensuring closed-loop stability
Single-point evaluation scheme: Employs a random exploration scheme requiring only one utility evaluation per time step (rather than two), better suited for online implementation
Theoretical guarantees:
Proves closed-loop system stability (Lemma 1: bounded expected Lyapunov function)
System model: Consider exponentially stable systems
xk+1=f(xk,uk)
where x∈Rnx is system state, u∈Rnu is control input, and there exists a unique steady-state input-state mapping h:Rnu→Rnx.
Optimization objective: Optimize user utility at steady-state
minx,uΦ(x,u),s.t. x=h(u)
equivalent to the unconstrained problem:
minuΦ~(u),where Φ~(u)=Φ(h(u),u)
Preference feedback model (Bradley-Terry model):
P(1u1≻u2=1)=σ(Φ~(u2)−Φ~(u1))
where σ(t)=1+e−t1 is the sigmoid function.
Key assumptions:
Input-state mapping h is Lipschitz continuous
Utility function Φ(x,u) is Lipschitz continuous in x
Φ~(u) is differentiable, Lipschitz continuous, smooth, and strongly convex
Input: step size η, smoothing parameter δ, initial input u₀, time steps T
for k = 1, ..., T-1:
1. Add random exploration: xₖ₊₁ = f(xₖ, uₖ + δvₖ)
where vₖ is uniformly sampled from (nᵤ-1)-dimensional unit sphere
2. Collect preference feedback:
Ask user to compare Φ(xₖ₊₁, uₖ + δvₖ) and Φ(xₖ, uₖ₋₁ + δvₖ₋₁)
Sample 𝟙_{(xₖ₊₁,uₖ+δvₖ)≻(xₖ,uₖ₋₁+δvₖ₋₁)}
3. Update control input:
uₖ₊₁ = uₖ + (η/2δ)𝟙_{(xₖ₊₁,uₖ+δvₖ)≻(xₖ,uₖ₋₁+δvₖ₋₁)}vₖ
end for
Output: uₜ
Extend theoretical framework to alternative user models (e.g., noiseless models)
Practical applications: Product design, chemical selection, etc.
Relax assumptions: Non-convex utility functions, unstable systems
Multi-agent scenarios: Preference aggregation from multiple users
Potential research directions:
5. Adaptive parameter adjustment: Online tuning of η and δ
6. User fatigue modeling: Reduce feedback frequency
7. Delayed feedback: Handle user response delays
8. High-dimensional optimization: More efficient exploration strategies
8 Z. He et al., 2023 - Model-free nonlinear feedback optimization
Main theoretical foundation of this paper
Provides single-point residual estimation concept
18 Y. Yue & T. Joachims, 2009 - Interactively optimizing information retrieval
Classical work on preference feedback gradient estimation
This paper improves upon its two-evaluation requirement
16 W. Xu et al., 2024 - Principled preferential Bayesian optimization
Recent advances in preference Bayesian optimization
Provides comparison baseline for GP-based methods
27 Y. Lian et al., 2023 - Adaptive robust data-driven building control
Practical system model for building control
Provides realistic scenario for experiments
9 D. Kahneman & A. Tversky, 2013 - Prospect theory
Non-rational human decision-making behavior
Highlights limitations of user model assumptions
Overall Assessment: This is an excellent paper with rigorous theory and strong innovation, successfully combining preference learning with closed-loop control and providing new theoretical framework for human-machine interaction system design. Main contributions lie in first-time provision of stability and optimality guarantees for real-time preference optimization, with practical method value (single-point evaluation). However, strong convexity assumptions, lack of real user experiments, and limited comparison experiments are main weaknesses. Future work should focus on relaxing assumptions, conducting real user studies, and extending to more complex practical scenarios. For researchers working on human-machine interaction control, preference learning, or online optimization, this paper merits in-depth study.