2025-11-11T09:55:09.434704

UCB-type Algorithm for Budget-Constrained Expert Learning

Latypov, Suvorikova, Kroshnin et al.

In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget. We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^Î±)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-Î±}\,T^Î±\Bigr)$. To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.

academic

UCB-type Algorithm for Budget-Constrained Expert Learning

基本信息

论文ID: 2510.22654
标题: UCB-type Algorithm for Budget-Constrained Expert Learning
作者: Ilgam Latypov, Alexandra Suvorikova, Alexey Kroshnin, Alexander Gasnikov, Yuriy Dorn
分类: cs.LG (Machine Learning), cs.MA (Multiagent Systems)
发表时间: October 28, 2025 (Preprint)
论文链接: https://arxiv.org/abs/2510.22654

推荐系统：并行运行多个预测器，根据用户反馈更新
金融平台：随着市场机制演变在交易策略间切换
大规模在线服务：管理上下文赌博机或强化学习算法组合

核心挑战

传统方法的局限性：

经典多臂赌博机(MAB)：假设静态或对抗性奖励分布，不考虑臂的学习能力
专家算法：通常需要完全反馈，不考虑专家的学习率
现有方法：都没有充分解决在每轮训练预算约束下管理多个同时学习专家的挑战

研究动机

本文旨在弥合这一空白，提出统一预测与选择性训练的程序，考虑固定的每轮计算预算约束。

核心贡献

新颖的UCB型元算法(M-LCB)：提出管理K个自学习专家池的新算法，考虑有限的每轮学习预算M(M≤K)
计算效率：提供直接从实现损失构建置信界限的方法，计算高效且避免昂贵的辅助优化
理论分析：根据专家个体收敛率估计元算法性能，当专家后悔为Õ(n^α)时，总体后悔为Õ(√(KT/M) + (K/M)^(1-α)T^α)
多重博弈赌博机扩展：证明M-LCB可扩展到多重博弈赌博机设置

方法详解

任务定义

决策空间U：专家建议的空间
环境空间E：随机结果空间
损失函数：ℓ : U×E → R₊
专家规范：每个专家k由元组(Wₖ,Hₖ,Aₖ,gₖ,υₖ)指定
- Wₖ：状态/参数空间
- Hₖ：历史空间
- Aₖ：在线学习算法
- gₖ：状态到建议的映射
- υₖ：安全建议生成器