2025-11-15T02:07:10.757818

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Lee, Oh

In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $Î©(d\sqrt{T/K})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{T/K})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $Î©(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.

academic

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Basic Information

Paper ID: 2405.09831
Title: Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
Authors: Joongkyu Lee (Seoul National University), Min-hwan Oh (Seoul National University)
Classification: stat.ML cs.LG
Publication Time/Venue: NeurIPS 2024 (38th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2405.09831

Abstract

This paper investigates the contextual multinomial logistic (MNL) bandit problem, where a learning agent sequentially selects product assortments based on contextual information, and user feedback follows the MNL choice model. Existing work exhibits significant gaps between lower and upper bounds, particularly with respect to the maximum assortment size K. Under the uniform reward setting, the paper establishes a regret lower bound of Ω(d√T/K) and proposes a constant-time algorithm OFU-MNL+, achieving a matching upper bound of Õ(d√T/K). Under the non-uniform reward setting, it proves a lower bound of Ω(d√T) and an upper bound of Õ(d√T). This is the first work to establish minimax optimality in the contextual MNL bandit literature.

Research Background and Motivation

Problem Background

MNL Bandit Problem: In applications such as recommendation systems and online retail, an agent must present a set of products to users, whose choice behavior follows the multinomial logistic (MNL) model
Contextual Information: In each round, the agent can observe product features and potentially user contextual information
Theoretical Gap: Existing work exhibits significant gaps between upper and lower bounds on regret, particularly regarding the dependence on assortment size K

Research Motivation

Theoretical Completeness: Fill the gap in theoretical analysis of MNL bandits and establish tight regret bounds
Algorithm Efficiency: Design computationally efficient algorithms that avoid the exponential time complexity of existing methods
Practical Applications: Provide theoretical guarantees and efficient algorithms for practical applications such as recommendation systems

Limitations of Existing Methods

Theoretical Gap: A gap of √K exists between the lower bound Ω(d√T/K) and upper bound Õ(d√T)
Computational Complexity: Existing algorithms require enumerating all possible product assortments, leading to exponential time complexity
Parameter Dependence: Poor dependence on problem-related constant κ, where 1/κ = O(K²)

Core Contributions

Establishing Tight Regret Bounds:
- Uniform reward setting: Lower bound Ω(√(v₀K/(v₀+K))d√T), upper bound Õ(√(v₀K/(v₀+K))d√T)
- Non-uniform reward setting: Lower bound Ω(d√T), upper bound Õ(d√T)
Proposing Efficient Algorithm OFU-MNL+:
- Constant time complexity O(1), independent of round t
- First algorithm in MNL bandits proving that regret decreases with increasing K
Theoretical Innovation:
- First explicit demonstration of the impact of external option attraction parameter v₀ on regret
- Provides instance-dependent minimax regret bounds
Technical Improvements:
- Improved elliptical potential lemma eliminating dependence on K
- Loss function analysis with constant self-concordance property

Methodology Details

Task Definition

Input:

At each round t, observe feature vectors x_ ∈ ℝᵈ of N products
Maximum assortment size K
External option attraction parameter v₀

Output:

Select product assortment S_t ⊆ {1,...,N}, |S_t| ≤ K
Observe user choice c_t ∈ S_t ∪ {0}, following the MNL model

Objective: Minimize cumulative regret Reg_T(w*) = Σ_^T R_t(S_t, w) - R_t(S_t, w*)

Model Architecture

MNL Choice Model

The probability that a user chooses product i ∈ S_t is:

p_t(i|S_t, w*) = exp(x_{ti}^T w*) / (v₀ + Σ_{j∈S_t} exp(x_{tj}^T w*))

The probability of the external option (choosing no product) is:

p_t(0|S_t, w*) = v₀ / (v₀ + Σ_{j∈S_t} exp(x_{tj}^T w*))

OFU-MNL+ Algorithm Core Components

Online Parameter Estimation:
```
w_{t+1} = argmin_{w∈W} ⟨∇ℓ_t(w_t), w⟩ + (1/2η)||w - w_t||²_{H̃_t}
```
where H̃_t = H_t + ηG_t(w_t), and G_t(w) is the Hessian matrix of the MNL loss
Confidence Set Construction:
```
C_t(δ) = {w ∈ W : ||w_t - w||_{H_t} ≤ β_t(δ)}
```
where β_t(δ) = O(√(d log t log K))

Optimistic Utility Computation:

α_{ti} = x_{ti}^T w_t + β_t(δ)||x_{ti}||_{H_t^{-1}}

Assortment Selection:
- Uniform reward: Select K products with highest α_
- Non-uniform reward: Solve a polynomial-time combinatorial optimization problem

Technical Innovations

Improved Self-Concordance Analysis: Proves that the MNL loss function has 3√2-self-concordance property, improving upon the previous √(6K) by a factor of √K

K-Independent Elliptical Potential Lemma:

Σ_{t=1}^T Σ_{i∈S_t} p_t(i|S_t,w_{t+1})p_t(0|S_t,w_{t+1})||x_{ti}||²_{H_t^{-1}} ≤ 2d log(1 + T/(dλ))

Tight KL Divergence Bounds: Establishes tighter KL divergence upper bounds, improving upon results by Chen et al.

Experimental Setup

Datasets

Synthetic datasets with parameter w* ∈ ℝᵈ uniformly sampled from -1/√d, 1/√dᵈ
Context features x_ sampled from multivariate Gaussian distribution N(0_d, I_d) and truncated to -1/√d, 1/√dᵈ
Configuration: N=100, K∈{5,10,15}, d=5, T=3000

Evaluation Metrics

Cumulative regret: Σ_^T R_t(S_t, w) - R_t(S_t, w*)
Computational time per round

Baseline Methods

UCB-MNL: Confidence upper bound-based method
TS-MNL: Thompson sampling-based method

Implementation Details

Regularization parameter λ = 84√(2d)η
Step size η = (1/2)log(K+1) + 2
Confidence radius β_t(δ) = O(√(d log t log K))

Experimental Results

Main Results

Regret Performance:
- Under uniform rewards, cumulative regret of all algorithms decreases as K increases
- Under non-uniform rewards, increasing K does not guarantee regret improvement
- OFU-MNL+ significantly outperforms baseline methods in all settings
Computational Efficiency:
- OFU-MNL+ maintains constant computational cost, independent of round t
- Baseline methods' computational time grows linearly with t
- Runtime under uniform reward setting is approximately 1/10 of non-uniform setting

Theoretical Verification

Experimental results validate theoretical analysis:

When v₀ = Θ(1), regret decreases with K
When v₀ = Θ(K), regret is independent of K
Under non-uniform rewards, regret is independent of K

MNL Bandit Research

Non-Contextual Setting: Agrawal et al. established Ω(√(NT/K)) lower bound
Contextual Setting: Chen et al. proposed Ω(d√T/K) lower bound, but algorithm complexity is exponential
Computational Efficiency: Oh and Iyengar proposed polynomial-time algorithm, but regret bound has 1/κ dependence

Logistic Bandits

Binary logistic bandits have optimal algorithms
MNL bandits are multi-choice extensions of logistic bandits

Combinatorial Bandits

Related to top-k combinatorial bandits but with different reward structures
In MNL model, individual product reward depends on the entire assortment

Conclusions and Discussion

Main Conclusions

Theoretical Completeness: First achievement of minimax optimality in MNL bandits
Algorithm Efficiency: Proposes the first optimal algorithm with constant time complexity
Parameter Impact: Explicitly quantifies the impact of v₀ and K on regret

Limitations

Bounded Assumption: Requires ||w*||₂ ≤ 1; relaxing this introduces additional exponential factors in regret bounds
Instance-Dependent Bounds: Cannot establish instance-dependent bounds under non-uniform rewards
Constant Factors: Logarithmic factors in theoretical bounds may be large in practice

Future Directions

Relaxing Assumptions: Study optimal algorithms under unbounded parameter settings
Instance Adaptation: Establish instance-dependent bounds for non-uniform rewards
Practical Applications: Validate algorithm performance on real recommendation systems

In-Depth Evaluation

Strengths

Significant Theoretical Contribution: First to resolve the optimality question for MNL bandits, filling an important theoretical gap
Notable Technical Innovation: Improved self-concordance analysis and elliptical potential lemma have independent value
High Practical Value: Constant time complexity makes the algorithm practically applicable
Comprehensive Analysis: Considers both uniform and non-uniform rewards, providing a complete theoretical picture

Weaknesses

Restrictive Assumptions: Bounded parameter assumption may be overly strict in practice
Constant Optimization: Constant factors in theoretical bounds may not be sufficiently tight
Limited Experiments: Experiments only on synthetic data, lacking validation on real datasets

Impact

Academic Value: Establishes solid foundation for MNL bandit theory, expected to have high citation count
Practical Value: Provides theoretical guidance for applications such as recommendation systems and online advertising
Methodological Contribution: Technical methods are generalizable to related problems

Applicable Scenarios

Recommendation Systems: Online product recommendation, content recommendation
Online Advertising: Advertisement assortment selection and deployment optimization
E-commerce: Product display and promotional strategy optimization

References

The paper cites 52 related references, primarily including:

Foundational MNL bandit work: Agrawal et al., Chen et al.
Logistic bandit theory: Faury et al., Abeille et al.
Online learning foundations: Lattimore & Szepesvári
Self-concordant function theory: Tran-Dinh et al.

Overall Assessment: This is a high-quality paper with outstanding theoretical contributions, successfully resolving a core open problem in MNL bandits with significant technical innovations and sufficient experimental validation, having important implications for related fields.