In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $Ω(d\sqrt{T/K})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{T/K})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $Ω(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
Nearly Minimax Optimal Regret for Multinomial Logistic Bandit Paper ID : 2405.09831Title : Nearly Minimax Optimal Regret for Multinomial Logistic BanditAuthors : Joongkyu Lee (Seoul National University), Min-hwan Oh (Seoul National University)Classification : stat.ML cs.LGPublication Time/Venue : NeurIPS 2024 (38th Conference on Neural Information Processing Systems)Paper Link : https://arxiv.org/abs/2405.09831 This paper investigates the contextual multinomial logistic (MNL) bandit problem, where a learning agent sequentially selects product assortments based on contextual information, and user feedback follows the MNL choice model. Existing work exhibits significant gaps between lower and upper bounds, particularly with respect to the maximum assortment size K. Under the uniform reward setting, the paper establishes a regret lower bound of Ω(d√T/K) and proposes a constant-time algorithm OFU-MNL+, achieving a matching upper bound of Õ(d√T/K). Under the non-uniform reward setting, it proves a lower bound of Ω(d√T) and an upper bound of Õ(d√T). This is the first work to establish minimax optimality in the contextual MNL bandit literature.
MNL Bandit Problem : In applications such as recommendation systems and online retail, an agent must present a set of products to users, whose choice behavior follows the multinomial logistic (MNL) modelContextual Information : In each round, the agent can observe product features and potentially user contextual informationTheoretical Gap : Existing work exhibits significant gaps between upper and lower bounds on regret, particularly regarding the dependence on assortment size KTheoretical Completeness : Fill the gap in theoretical analysis of MNL bandits and establish tight regret boundsAlgorithm Efficiency : Design computationally efficient algorithms that avoid the exponential time complexity of existing methodsPractical Applications : Provide theoretical guarantees and efficient algorithms for practical applications such as recommendation systemsTheoretical Gap : A gap of √K exists between the lower bound Ω(d√T/K) and upper bound Õ(d√T)Computational Complexity : Existing algorithms require enumerating all possible product assortments, leading to exponential time complexityParameter Dependence : Poor dependence on problem-related constant κ, where 1/κ = O(K²)Establishing Tight Regret Bounds :Uniform reward setting: Lower bound Ω(√(v₀K/(v₀+K))d√T), upper bound Õ(√(v₀K/(v₀+K))d√T) Non-uniform reward setting: Lower bound Ω(d√T), upper bound Õ(d√T) Proposing Efficient Algorithm OFU-MNL+ :Constant time complexity O(1), independent of round t First algorithm in MNL bandits proving that regret decreases with increasing K Theoretical Innovation :First explicit demonstration of the impact of external option attraction parameter v₀ on regret Provides instance-dependent minimax regret bounds Technical Improvements :Improved elliptical potential lemma eliminating dependence on K Loss function analysis with constant self-concordance property Input :
At each round t, observe feature vectors x_ ∈ ℝᵈ of N products Maximum assortment size K External option attraction parameter v₀ Output :
Select product assortment S_t ⊆ {1,...,N}, |S_t| ≤ K Observe user choice c_t ∈ S_t ∪ {0}, following the MNL model Objective : Minimize cumulative regret Reg_T(w*) = Σ_^T R_t(S_t, w ) - R_t(S_t, w*)
The probability that a user chooses product i ∈ S_t is:
p_t(i|S_t, w*) = exp(x_{ti}^T w*) / (v₀ + Σ_{j∈S_t} exp(x_{tj}^T w*))
The probability of the external option (choosing no product) is:
p_t(0|S_t, w*) = v₀ / (v₀ + Σ_{j∈S_t} exp(x_{tj}^T w*))
Online Parameter Estimation :w_{t+1} = argmin_{w∈W} ⟨∇ℓ_t(w_t), w⟩ + (1/2η)||w - w_t||²_{H̃_t}
where H̃_t = H_t + ηG_t(w_t), and G_t(w) is the Hessian matrix of the MNL lossConfidence Set Construction :C_t(δ) = {w ∈ W : ||w_t - w||_{H_t} ≤ β_t(δ)}
where β_t(δ) = O(√(d log t log K))Optimistic Utility Computation :α_{ti} = x_{ti}^T w_t + β_t(δ)||x_{ti}||_{H_t^{-1}}
Assortment Selection :Uniform reward: Select K products with highest α_ Non-uniform reward: Solve a polynomial-time combinatorial optimization problem Improved Self-Concordance Analysis : Proves that the MNL loss function has 3√2-self-concordance property, improving upon the previous √(6K) by a factor of √KK-Independent Elliptical Potential Lemma :Σ_{t=1}^T Σ_{i∈S_t} p_t(i|S_t,w_{t+1})p_t(0|S_t,w_{t+1})||x_{ti}||²_{H_t^{-1}} ≤ 2d log(1 + T/(dλ))
Tight KL Divergence Bounds : Establishes tighter KL divergence upper bounds, improving upon results by Chen et al.Synthetic datasets with parameter w* ∈ ℝᵈ uniformly sampled from -1/√d, 1/√d ᵈ Context features x_ sampled from multivariate Gaussian distribution N(0_d, I_d) and truncated to -1/√d, 1/√d ᵈ Configuration: N=100, K∈{5,10,15}, d=5, T=3000 Cumulative regret: Σ_^T R_t(S_t, w ) - R_t(S_t, w*) Computational time per round UCB-MNL: Confidence upper bound-based method TS-MNL: Thompson sampling-based method Regularization parameter λ = 84√(2d)η Step size η = (1/2)log(K+1) + 2 Confidence radius β_t(δ) = O(√(d log t log K)) Regret Performance :Under uniform rewards, cumulative regret of all algorithms decreases as K increases Under non-uniform rewards, increasing K does not guarantee regret improvement OFU-MNL+ significantly outperforms baseline methods in all settings Computational Efficiency :OFU-MNL+ maintains constant computational cost, independent of round t Baseline methods' computational time grows linearly with t Runtime under uniform reward setting is approximately 1/10 of non-uniform setting Experimental results validate theoretical analysis:
When v₀ = Θ(1), regret decreases with K When v₀ = Θ(K), regret is independent of K Under non-uniform rewards, regret is independent of K Non-Contextual Setting : Agrawal et al. established Ω(√(NT/K)) lower boundContextual Setting : Chen et al. proposed Ω(d√T/K) lower bound, but algorithm complexity is exponentialComputational Efficiency : Oh and Iyengar proposed polynomial-time algorithm, but regret bound has 1/κ dependenceBinary logistic bandits have optimal algorithms MNL bandits are multi-choice extensions of logistic bandits Related to top-k combinatorial bandits but with different reward structures In MNL model, individual product reward depends on the entire assortment Theoretical Completeness : First achievement of minimax optimality in MNL banditsAlgorithm Efficiency : Proposes the first optimal algorithm with constant time complexityParameter Impact : Explicitly quantifies the impact of v₀ and K on regretBounded Assumption : Requires ||w*||₂ ≤ 1; relaxing this introduces additional exponential factors in regret boundsInstance-Dependent Bounds : Cannot establish instance-dependent bounds under non-uniform rewardsConstant Factors : Logarithmic factors in theoretical bounds may be large in practiceRelaxing Assumptions : Study optimal algorithms under unbounded parameter settingsInstance Adaptation : Establish instance-dependent bounds for non-uniform rewardsPractical Applications : Validate algorithm performance on real recommendation systemsSignificant Theoretical Contribution : First to resolve the optimality question for MNL bandits, filling an important theoretical gapNotable Technical Innovation : Improved self-concordance analysis and elliptical potential lemma have independent valueHigh Practical Value : Constant time complexity makes the algorithm practically applicableComprehensive Analysis : Considers both uniform and non-uniform rewards, providing a complete theoretical pictureRestrictive Assumptions : Bounded parameter assumption may be overly strict in practiceConstant Optimization : Constant factors in theoretical bounds may not be sufficiently tightLimited Experiments : Experiments only on synthetic data, lacking validation on real datasetsAcademic Value : Establishes solid foundation for MNL bandit theory, expected to have high citation countPractical Value : Provides theoretical guidance for applications such as recommendation systems and online advertisingMethodological Contribution : Technical methods are generalizable to related problemsRecommendation Systems : Online product recommendation, content recommendationOnline Advertising : Advertisement assortment selection and deployment optimizationE-commerce : Product display and promotional strategy optimizationThe paper cites 52 related references, primarily including:
Foundational MNL bandit work: Agrawal et al., Chen et al. Logistic bandit theory: Faury et al., Abeille et al. Online learning foundations: Lattimore & Szepesvári Self-concordant function theory: Tran-Dinh et al. Overall Assessment : This is a high-quality paper with outstanding theoretical contributions, successfully resolving a core open problem in MNL bandits with significant technical innovations and sufficient experimental validation, having important implications for related fields.