2025-11-11T09:55:09.434704

UCB-type Algorithm for Budget-Constrained Expert Learning

Latypov, Suvorikova, Kroshnin et al.

In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget. We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^Î±)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-Î±}\,T^Î±\Bigr)$. To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.

academic

予算制約付き専門家学習のためのUCB型アルゴリズム

基本情報

論文ID: 2510.22654
タイトル: UCB-type Algorithm for Budget-Constrained Expert Learning
著者: Ilgam Latypov, Alexandra Suvorikova, Alexey Kroshnin, Alexander Gasnikov, Yuriy Dorn
分類: cs.LG（機械学習）、cs.MA（マルチエージェントシステム）
発表日時: 2025年10月28日（プレプリント）
論文リンク: https://arxiv.org/abs/2510.22654

要約

多くの現代的なアプリケーションでは、システムは複数のオンライン訓練適応学習アルゴリズム間で動的に選択する必要があります。例えば、ストリーム環境でのモデル選択、金融における取引戦略の切り替え、および複数の文脈バンディットまたは強化学習エージェントの協調などです。各ラウンドにおいて、学習者はK個の適応的専門家から1つの予測器を選択して予測を行う一方で、固定された訓練予算下で最大M≤K個の専門家のみを更新できます。

本論文は確率的設定下でこの問題を解決し、M-LCBアルゴリズムを提案します。これは計算効率的なUCB型メタアルゴリズムであり、任意時間後悔保証を提供します。その信頼区間は実現された損失から直接構築され、追加の最適化を必要とせず、基礎となる専門家の収束特性をシームレスに反映します。各専門家が内部後悔Õ(T^α)を達成する場合、M-LCBは総体的後悔界限Õ(√(KT/M) + (K/M)^(1-α)T^α)を保証します。

研究背景と動機

問題定義

現実の多くのアプリケーションは、複数の自己学習専門家間での動的選択を必要とします：

推奨システム：複数の予測器を並行実行し、ユーザーフィードバックに基づいて更新
金融プラットフォーム：市場メカニズムの進化に伴い取引戦略間で切り替え
大規模オンラインサービス：文脈バンディットまたは強化学習アルゴリズムの組み合わせを管理

中核的課題

従来の方法の限界：

古典的多腕バンディット（MAB）：静的または敵対的報酬分布を仮定し、腕の学習能力を考慮しない
専門家アルゴリズム：通常、完全なフィードバックを必要とし、専門家の学習率を考慮しない
既存の方法：各ラウンドの訓練予算制約下で複数の同時学習専門家を管理する課題に十分に対処していない

研究動機

本論文は、この空白を埋めることを目指し、固定された各ラウンド計算予算制約を考慮した予測と選択的訓練を統一するプロシージャを提案します。

中核的貢献

新規なUCB型メタアルゴリズム（M-LCB）：有限な各ラウンド学習予算M（M≤K）を考慮したK個の自己学習専門家プールを管理する新しいアルゴリズムを提案
計算効率性：実現された損失から直接信頼界限を構築する方法を提供し、計算効率的で高価な補助的最適化を回避
理論的分析：専門家個別の収束率推定に基づいてメタアルゴリズムの性能を分析し、専門家後悔がÕ(n^α)の場合、総体的後悔はÕ(√(KT/M) + (K/M)^(1-α)T^α)
複数ゲームバンディット拡張：M-LCBが複数ゲームバンディット設定に拡張可能であることを証明

方法の詳細

タスク定義

決定空間U：専門家提案の空間
環境空間E：確率的結果の空間
損失関数：ℓ : U×E → R₊
専門家仕様：各専門家kはタプル(Wₖ,Hₖ,Aₖ,gₖ,υₖ)で指定
- Wₖ：状態/パラメータ空間
- Hₖ：履歴空間
- Aₖ：オンライン学習アルゴリズム
- gₖ：状態から提案への写像
- υₖ：安全提案生成器