2025-11-15T02:07:10.757818

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Lee, Oh
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $Ω(d\sqrt{T/K})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{T/K})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $Ω(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
academic

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

Basic Information

  • Paper ID: 2405.09831
  • Title: Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
  • Authors: Joongkyu Lee (Seoul National University), Min-hwan Oh (Seoul National University)
  • Classification: stat.ML cs.LG
  • Publication Time/Venue: NeurIPS 2024 (38th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2405.09831

Abstract

This paper investigates the contextual multinomial logistic (MNL) bandit problem, where a learning agent sequentially selects product assortments based on contextual information, and user feedback follows the MNL choice model. Existing work exhibits significant gaps between lower and upper bounds, particularly with respect to the maximum assortment size K. Under the uniform reward setting, the paper establishes a regret lower bound of Ω(d√T/K) and proposes a constant-time algorithm OFU-MNL+, achieving a matching upper bound of Õ(d√T/K). Under the non-uniform reward setting, it proves a lower bound of Ω(d√T) and an upper bound of Õ(d√T). This is the first work to establish minimax optimality in the contextual MNL bandit literature.

Research Background and Motivation

Problem Background

  1. MNL Bandit Problem: In applications such as recommendation systems and online retail, an agent must present a set of products to users, whose choice behavior follows the multinomial logistic (MNL) model
  2. Contextual Information: In each round, the agent can observe product features and potentially user contextual information
  3. Theoretical Gap: Existing work exhibits significant gaps between upper and lower bounds on regret, particularly regarding the dependence on assortment size K

Research Motivation

  1. Theoretical Completeness: Fill the gap in theoretical analysis of MNL bandits and establish tight regret bounds
  2. Algorithm Efficiency: Design computationally efficient algorithms that avoid the exponential time complexity of existing methods
  3. Practical Applications: Provide theoretical guarantees and efficient algorithms for practical applications such as recommendation systems

Limitations of Existing Methods

  1. Theoretical Gap: A gap of √K exists between the lower bound Ω(d√T/K) and upper bound Õ(d√T)
  2. Computational Complexity: Existing algorithms require enumerating all possible product assortments, leading to exponential time complexity
  3. Parameter Dependence: Poor dependence on problem-related constant κ, where 1/κ = O(K²)

Core Contributions

  1. Establishing Tight Regret Bounds:
    • Uniform reward setting: Lower bound Ω(√(v₀K/(v₀+K))d√T), upper bound Õ(√(v₀K/(v₀+K))d√T)
    • Non-uniform reward setting: Lower bound Ω(d√T), upper bound Õ(d√T)
  2. Proposing Efficient Algorithm OFU-MNL+:
    • Constant time complexity O(1), independent of round t
    • First algorithm in MNL bandits proving that regret decreases with increasing K
  3. Theoretical Innovation:
    • First explicit demonstration of the impact of external option attraction parameter v₀ on regret
    • Provides instance-dependent minimax regret bounds
  4. Technical Improvements:
    • Improved elliptical potential lemma eliminating dependence on K
    • Loss function analysis with constant self-concordance property

Methodology Details

Task Definition

Input:

  • At each round t, observe feature vectors x_ ∈ ℝᵈ of N products
  • Maximum assortment size K
  • External option attraction parameter v₀

Output:

  • Select product assortment S_t ⊆ {1,...,N}, |S_t| ≤ K
  • Observe user choice c_t ∈ S_t ∪ {0}, following the MNL model

Objective: Minimize cumulative regret Reg_T(w*) = Σ_^T R_t(S_t, w) - R_t(S_t, w*)

Model Architecture

MNL Choice Model

The probability that a user chooses product i ∈ S_t is:

p_t(i|S_t, w*) = exp(x_{ti}^T w*) / (v₀ + Σ_{j∈S_t} exp(x_{tj}^T w*))

The probability of the external option (choosing no product) is:

p_t(0|S_t, w*) = v₀ / (v₀ + Σ_{j∈S_t} exp(x_{tj}^T w*))

OFU-MNL+ Algorithm Core Components

  1. Online Parameter Estimation:
    w_{t+1} = argmin_{w∈W} ⟨∇ℓ_t(w_t), w⟩ + (1/2η)||w - w_t||²_{H̃_t}
    

    where H̃_t = H_t + ηG_t(w_t), and G_t(w) is the Hessian matrix of the MNL loss
  2. Confidence Set Construction:
    C_t(δ) = {w ∈ W : ||w_t - w||_{H_t} ≤ β_t(δ)}
    

    where β_t(δ) = O(√(d log t log K))
  3. Optimistic Utility Computation:
    α_{ti} = x_{ti}^T w_t + β_t(δ)||x_{ti}||_{H_t^{-1}}
    
  4. Assortment Selection:
    • Uniform reward: Select K products with highest α_
    • Non-uniform reward: Solve a polynomial-time combinatorial optimization problem

Technical Innovations

  1. Improved Self-Concordance Analysis: Proves that the MNL loss function has 3√2-self-concordance property, improving upon the previous √(6K) by a factor of √K
  2. K-Independent Elliptical Potential Lemma:
    Σ_{t=1}^T Σ_{i∈S_t} p_t(i|S_t,w_{t+1})p_t(0|S_t,w_{t+1})||x_{ti}||²_{H_t^{-1}} ≤ 2d log(1 + T/(dλ))
    
  3. Tight KL Divergence Bounds: Establishes tighter KL divergence upper bounds, improving upon results by Chen et al.

Experimental Setup

Datasets

  • Synthetic datasets with parameter w* ∈ ℝᵈ uniformly sampled from -1/√d, 1/√d
  • Context features x_ sampled from multivariate Gaussian distribution N(0_d, I_d) and truncated to -1/√d, 1/√d
  • Configuration: N=100, K∈{5,10,15}, d=5, T=3000

Evaluation Metrics

  • Cumulative regret: Σ_^T R_t(S_t, w) - R_t(S_t, w*)
  • Computational time per round

Baseline Methods

  • UCB-MNL: Confidence upper bound-based method
  • TS-MNL: Thompson sampling-based method

Implementation Details

  • Regularization parameter λ = 84√(2d)η
  • Step size η = (1/2)log(K+1) + 2
  • Confidence radius β_t(δ) = O(√(d log t log K))

Experimental Results

Main Results

  1. Regret Performance:
    • Under uniform rewards, cumulative regret of all algorithms decreases as K increases
    • Under non-uniform rewards, increasing K does not guarantee regret improvement
    • OFU-MNL+ significantly outperforms baseline methods in all settings
  2. Computational Efficiency:
    • OFU-MNL+ maintains constant computational cost, independent of round t
    • Baseline methods' computational time grows linearly with t
    • Runtime under uniform reward setting is approximately 1/10 of non-uniform setting

Theoretical Verification

Experimental results validate theoretical analysis:

  • When v₀ = Θ(1), regret decreases with K
  • When v₀ = Θ(K), regret is independent of K
  • Under non-uniform rewards, regret is independent of K

MNL Bandit Research

  1. Non-Contextual Setting: Agrawal et al. established Ω(√(NT/K)) lower bound
  2. Contextual Setting: Chen et al. proposed Ω(d√T/K) lower bound, but algorithm complexity is exponential
  3. Computational Efficiency: Oh and Iyengar proposed polynomial-time algorithm, but regret bound has 1/κ dependence

Logistic Bandits

  • Binary logistic bandits have optimal algorithms
  • MNL bandits are multi-choice extensions of logistic bandits

Combinatorial Bandits

  • Related to top-k combinatorial bandits but with different reward structures
  • In MNL model, individual product reward depends on the entire assortment

Conclusions and Discussion

Main Conclusions

  1. Theoretical Completeness: First achievement of minimax optimality in MNL bandits
  2. Algorithm Efficiency: Proposes the first optimal algorithm with constant time complexity
  3. Parameter Impact: Explicitly quantifies the impact of v₀ and K on regret

Limitations

  1. Bounded Assumption: Requires ||w*||₂ ≤ 1; relaxing this introduces additional exponential factors in regret bounds
  2. Instance-Dependent Bounds: Cannot establish instance-dependent bounds under non-uniform rewards
  3. Constant Factors: Logarithmic factors in theoretical bounds may be large in practice

Future Directions

  1. Relaxing Assumptions: Study optimal algorithms under unbounded parameter settings
  2. Instance Adaptation: Establish instance-dependent bounds for non-uniform rewards
  3. Practical Applications: Validate algorithm performance on real recommendation systems

In-Depth Evaluation

Strengths

  1. Significant Theoretical Contribution: First to resolve the optimality question for MNL bandits, filling an important theoretical gap
  2. Notable Technical Innovation: Improved self-concordance analysis and elliptical potential lemma have independent value
  3. High Practical Value: Constant time complexity makes the algorithm practically applicable
  4. Comprehensive Analysis: Considers both uniform and non-uniform rewards, providing a complete theoretical picture

Weaknesses

  1. Restrictive Assumptions: Bounded parameter assumption may be overly strict in practice
  2. Constant Optimization: Constant factors in theoretical bounds may not be sufficiently tight
  3. Limited Experiments: Experiments only on synthetic data, lacking validation on real datasets

Impact

  1. Academic Value: Establishes solid foundation for MNL bandit theory, expected to have high citation count
  2. Practical Value: Provides theoretical guidance for applications such as recommendation systems and online advertising
  3. Methodological Contribution: Technical methods are generalizable to related problems

Applicable Scenarios

  1. Recommendation Systems: Online product recommendation, content recommendation
  2. Online Advertising: Advertisement assortment selection and deployment optimization
  3. E-commerce: Product display and promotional strategy optimization

References

The paper cites 52 related references, primarily including:

  • Foundational MNL bandit work: Agrawal et al., Chen et al.
  • Logistic bandit theory: Faury et al., Abeille et al.
  • Online learning foundations: Lattimore & Szepesvári
  • Self-concordant function theory: Tran-Dinh et al.

Overall Assessment: This is a high-quality paper with outstanding theoretical contributions, successfully resolving a core open problem in MNL bandits with significant technical innovations and sufficient experimental validation, having important implications for related fields.