2025-11-24T01:22:17.846878

Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need

Sarkar, Pandey, Chowdhury
Regret in stochastic multi-armed bandits traditionally measures the difference between the highest reward and either the arithmetic mean of accumulated rewards or the final reward. These conventional metrics often fail to address fairness among agents receiving rewards, particularly in settings where rewards are distributed across a population, such as patients in clinical trials. To address this, a recent body of work has introduced Nash regret, which evaluates performance via the geometric mean of accumulated rewards, aligning with the Nash social welfare function known for satisfying fairness axioms. To minimize Nash regret, existing approaches require specialized algorithm designs and strong assumptions, such as multiplicative concentration inequalities and bounded, non-negative rewards, making them unsuitable for even Gaussian reward distributions. We demonstrate that an initial uniform exploration phase followed by a standard Upper Confidence Bound (UCB) algorithm achieves near-optimal Nash regret, while relying only on additive Hoeffding bounds, and naturally extending to sub-Gaussian rewards. Furthermore, we generalize the algorithm to a broad class of fairness metrics called the $p$-mean regret, proving (nearly) optimal regret bounds uniformly across all $p$ values. This is in contrast to prior work, which made extremely restrictive assumptions on the bandit instances and even then achieved suboptimal regret bounds.
academic

Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need

Basic Information

Abstract

This paper investigates fairness issues in the Multi-Armed Bandits (MAB) problem. Traditional regret metrics based on arithmetic means cannot guarantee fairness across users. To address this, the community introduced Nash regret metrics based on geometric means. However, existing methods for minimizing Nash regret require specialized algorithm design and strong assumptions (such as multiplicative concentration inequalities, bounded non-negative rewards), and even fail to apply to Gaussian distributions. This paper demonstrates that through data-adaptive uniform exploration followed by standard UCB algorithms, one can achieve near-optimal Nash regret while relying only on additive Hoeffding bounds, naturally extending to sub-Gaussian rewards. Furthermore, the paper generalizes the algorithm to p-mean regret, a broad class of fairness metrics, proving (near-)optimal regret bounds across all p values without the stringent assumptions of prior work.

Research Background and Motivation

Problem Definition

  1. Core Problem: In multi-armed bandit problems, how can one maximize cumulative rewards while ensuring fairness across time steps (corresponding to different users/patients)? Traditional average regret only concerns overall utility and may result in extremely low rewards at certain time steps, lacking fairness guarantees.
  2. Importance: In clinical trials and resource allocation scenarios, each time step corresponds to an independent individual (e.g., a patient). Optimizing only average utility may lead to unfair treatment of some individuals, which is unacceptable from ethical and social welfare perspectives.
  3. Limitations of Existing Methods:
    • Barman et al. (2023) proposed the Nash Confidence Bound (NCB) algorithm but relies on multiplicative Hoeffding/Chernoff bounds, requiring bounded non-negative rewards and inapplicable to general distributions like Gaussian. It implicitly requires knowledge of an upper bound on μ*.
    • Krishna et al. (2025) studied p-mean regret but requires each arm's expected reward to be at least Ω̃(√k/T^(1/4)), an extremely restrictive assumption that excludes many practical scenarios and yields suboptimal regret bounds for certain p values.
  4. Research Motivation: Develop a general framework using classical UCB algorithms to achieve fairness objectives without specialized confidence bound design, applicable to general sub-Gaussian distributions without unrealistic assumptions on unknown means.

Core Contributions

  1. Reduction Framework: Proposes reducing Nash/p-mean regret minimization to brief data-adaptive uniform exploration followed by standard bandit algorithms (e.g., UCB), with the key innovation being a data-adaptive stopping rule.
  2. Algorithm Design: Designs the Welfarist-UCB algorithm with a two-phase strategy:
    • Phase I: Uniform exploration until satisfying adaptive termination conditions
    • Phase II: Exploration-exploitation using standard UCB indices
  3. Theoretical Guarantees:
    • For Nash regret (p=0): Achieves Õ(σ√(k/T)) order-optimal bounds applicable to σ-sub-Gaussian rewards
    • For p∈0,1: Achieves Õ(σ√(k/T)) order-optimal bounds
    • For p∈[-1,0): Achieves Õ(k/√T), improving upon Krishna et al.'s Õ(k^(3/4)T^(-1/4))
    • For p<-1: Achieves Õ(|p|k^((|p|+1)/2)/√T), strictly superior to prior work in practically relevant T ranges
  4. Technical Innovation: Uses additive Hoeffding bounds rather than multiplicative bounds, avoiding strict restrictions on reward distributions and eliminating the need for upper bounds on μ*.
  5. Experimental Validation: Numerical simulations verify algorithm effectiveness across different p values, demonstrating the "no-free-lunch" principle: stronger fairness requirements lead to higher regret.

Methodology Details

Task Definition

Input:

  • k arms with reward distributions ρᵢ for arm i being σ-sub-Gaussian with mean μᵢ≥0 (unknown)
  • Time horizon T
  • Fairness parameter p∈(-∞,1]

Output: Arm Iₜ selected at each round t∈T

Objective: Minimize p-mean regret RTp:=μ(1Tt=1T(E[μIt])p)1/pR^p_T := \mu^* - \left(\frac{1}{T}\sum_{t=1}^T (\mathbb{E}[\mu_{I_t}])^p\right)^{1/p} where μ*=maxᵢμᵢ. In particular, p=0 corresponds to Nash regret (geometric mean).

Model Architecture

Phase I: Data-Adaptive Uniform Exploration

Exploration Strategy:

  • Each k steps form a block
  • At the start of each block, uniformly randomly sample a permutation π∈Πₖ of arms
  • Pull arms in order π(1), π(2), ..., π(k)

Termination Condition: Phase I terminates when some arm i satisfies both:

  1. μ^i>22σ2logTni\hat{\mu}_i > 2\sqrt{\frac{2\sigma^2\log T}{n_i}}
  2. ni>192pa2σ2logT(μ^i22σ2logTni)2n_i > \frac{192p_a^2\sigma^2\log T}{(\hat{\mu}_i - 2\sqrt{\frac{2\sigma^2\log T}{n_i}})^2}

where:

  • nᵢ: number of times arm i has been pulled
  • μ̂ᵢ: empirical mean of observed rewards from arm i
  • pₐ = 1 (if p≥-1), pₐ = p (if p<-1)

Key Property (Lemma 4.1): Under high-probability event E, Phase I duration τ satisfies 32kSτ128kS32kS \leq \tau \leq 128kS where S=4pa2σ2logT(μ)2S = \frac{4p_a^2\sigma^2\log T}{(\mu^*)^2}

This means each arm is pulled Θ(1/(μ*)²) times, which is crucial for subsequent analysis.

Phase II: UCB Exploration-Exploitation

Uses standard UCB index: UCBi=μ^i+22σ2logTni\text{UCB}_i = \hat{\mu}_i + 2\sqrt{\frac{2\sigma^2\log T}{n_i}}

At each round, select the arm with maximum UCB value.

Technical Innovations

1. Design of Data-Adaptive Termination Condition

Innovation: The denominator (μ^i22σ2logTni)2(\hat{\mu}_i - 2\sqrt{\frac{2\sigma^2\log T}{n_i}})^2 in the termination condition ensures Phase I terminates after Θ(1/(μ*)²) rounds, rather than fixed Θ(√T) or Θ(1/μ*) rounds.

Rationale:

  • When μ* is large, fewer exploration rounds are needed to identify good arms
  • When μ* is small, more exploration is needed to distinguish between arms
  • This adaptivity ensures that for any pulled arm i in Phase II, ξi:=22σ2logTμni1/2\xi_i := \frac{2\sqrt{2\sigma^2\log T}}{\mu^*\sqrt{n_i}} \leq 1/2, which is key to controlling Nash regret

2. Using Additive Hoeffding Bounds

Comparison with NCB:

  • NCB: Conditions on event {i,μi[μ^i4μ^ilogTni,μ^i+4μ^ilogTni]}\{\forall i, \mu_i \in [\hat{\mu}_i - 4\sqrt{\frac{\hat{\mu}_i\log T}{n_i}}, \hat{\mu}_i + 4\sqrt{\frac{\hat{\mu}_i\log T}{n_i}}]\}, requiring multiplicative Hoeffding bounds
  • This paper: Conditions on event {i,μi[μ^i22σ2logTni,μ^i+22σ2logTni]}\{\forall i, \mu_i \in [\hat{\mu}_i - 2\sqrt{\frac{2\sigma^2\log T}{n_i}}, \hat{\mu}_i + 2\sqrt{\frac{2\sigma^2\log T}{n_i}}]\}, requiring only additive Hoeffding bounds

Advantages:

  • Additive bounds apply to any sub-Gaussian distribution (including Gaussian, bounded, etc.)
  • No requirement for strictly non-negative or bounded rewards
  • No need to pre-know upper bounds on μ*

3. Permutation Sampling Equivalence (Lemma B.3)

Key Observation: Phase I's permutation sampling strategy is marginally equivalent to independent uniform sampling at each round, i.e., PrIₜ=i=1/k, thus 𝔼μ_{Iₜ}≥μ*/k.

Technical Significance: This static coupling simplifies Phase I analysis, allowing direct use of uniform sampling properties.

Experimental Setup

Datasets

This paper uses synthetic data for numerical simulations:

  1. Bernoulli Arms: k=50 arms with means uniformly sampled from 0.005, 1
  2. Gaussian Arms: k=50 arms with means uniformly sampled from 10, 1000, standard deviation σ=20

Evaluation Metrics

  • Nash Regret (p=0): μ(t=1TE[μIt])1/T\mu^* - (\prod_{t=1}^T \mathbb{E}[\mu_{I_t}])^{1/T}
  • p-mean Regret: μ(1Tt=1T(E[μIt])p)1/p\mu^* - (\frac{1}{T}\sum_{t=1}^T (\mathbb{E}[\mu_{I_t}])^p)^{1/p}

Estimate 𝔼μ_{Iₜ} through average rewards over 50 runs.

Baseline Methods

  1. NCB (Barman et al., 2023): Using Nash Confidence Bound index
  2. Explore-then-UCB (Krishna et al., 2025): UCB after fixed exploration phase

Implementation Details

  • Time horizon T gradually increased from small to large values
  • Test different p values (p=0.5, 0, -0.5, -1.5)
  • All experiments reproducible through provided code repository

Experimental Results

Main Results

1. Nash Regret (p=0)

Bernoulli Rewards (Figure 1a):

  • Welfarist-UCB's Nash regret decreases significantly faster than NCB as T increases
  • Verifies theoretical Õ(√(k/T)) convergence rate

Sub-Gaussian Rewards (Figure 1b):

  • Welfarist-UCB effectively minimizes Nash regret even under Gaussian distribution
  • NCB performs poorly in this setting, validating the method's applicability to general distributions

2. p-mean Regret Across Three Intervals

p=0.5 (0<p≤1) (Figure 1c):

  • Welfarist-UCB outperforms Explore-then-UCB across all T values
  • Regret decrease rate matches theoretical prediction of Õ(√(k/T))

p=-0.5 (-1<p<0) (Figure 1d):

  • Welfarist-UCB significantly outperforms Explore-then-UCB
  • Validates Õ(k/√T) bound improvement over Krishna et al.'s Õ(k^(3/4)T^(-1/4))

p=-1.5 (p≤-1) (Figure 1e):

  • Stricter fairness requirements lead to higher regret
  • Welfarist-UCB still outperforms baseline methods

3. Ablation Study: Effect of p Value (Figure 1f)

Key Findings:

  • As p decreases (stronger fairness requirements), p-mean regret continuously increases
  • Validates "no-free-lunch" hypothesis: stronger fairness comes at higher regret cost
  • As p→-∞ (Rawlsian fairness), regret bounds become vacuous, indicating extreme fairness is unattainable in short term

Experimental Findings

  1. Practical Effectiveness: Welfarist-UCB demonstrates superior practical performance across all tested scenarios
  2. Distribution Robustness: Algorithm works effectively for different reward distributions (Bernoulli, Gaussian), validating broad theoretical applicability
  3. Fairness-Utility Tradeoff: Experiments clearly demonstrate the tradeoff relationship between fairness parameter p and regret, guiding practical p selection
  4. Convergence Rate: Across all p intervals, Welfarist-UCB's regret decrease rate matches theoretical predictions and outperforms existing methods

1. Fairness in Multi-Armed Bandits

Different Fairness Concepts:

  • Arm Fairness (Joseph et al. 2016; Patil et al. 2021): Ensuring different arms are treated fairly
  • Multi-Agent Fairness (Hossain et al. 2021; Jones et al. 2023): Fair reward distribution to multiple agents per pull
  • This Paper's Focus: Cross-temporal fairness, where each time step corresponds to an independent individual

2. Nash Regret

Barman et al. (2023):

  • First proposed Nash regret concept
  • Designed NCB algorithm achieving Õ(√(k/T)) bounds
  • Limitations: Requires multiplicative concentration inequalities, restricted to bounded non-negative rewards

3. p-mean Welfare

Theoretical Foundation (Moulin 2004):

  • p-mean functions defined by five axioms: anonymity, scale invariance, continuity, monotonicity, symmetry
  • Satisfies Pigou-Dalton transfer principle (when p≤1)

Krishna et al. (2025):

  • Studied general p-mean regret
  • Used Explore-then-UCB approach
  • Limitations: Requires μᵢ≥Ω̃(√k/T^(1/4)), suboptimal bounds in certain intervals
  • Nash Regret in Linear Bandits (Sawarni et al. 2024)
  • Online Nash Social Welfare Maximization (Zhang et al. 2024)
  • Fairness in Multi-Agent Reinforcement Learning (Mandal & Gan 2022)
  • Privacy-Fairness Intersection (Sarkar et al. 2025): DP-NCB framework
  1. Weaker Assumptions: Only requires μᵢ≥0 and sub-Gaussianity, no lower bounds on μᵢ or reward boundedness
  2. Better Bounds: Achieves Õ(k/√T) for p∈[-1,0), improving upon prior Õ(k^(3/4)T^(-1/4))
  3. Unified Framework: Single algorithm handles all p values without p-specific design
  4. Technical Simplicity: Uses standard UCB and additive Hoeffding bounds, avoiding complex specialized designs

Conclusions and Discussion

Main Conclusions

  1. Theoretical Contribution: Proves that classical UCB combined with data-adaptive exploration achieves near-optimal fairness guarantees without specialized confidence bound design
  2. Practical Significance: Makes fairness-aware sequential decision-making more practical and broadly applicable, especially for scenarios requiring general distributions like Gaussian
  3. "No-Free-Lunch" Principle: Stricter fairness requirements inherently increase learning problem difficulty, requiring more samples for low regret

Limitations

  1. Assumption Constraints:
    • Still requires μᵢ≥0 (though standard, may be restrictive in some applications)
    • Sub-Gaussian assumption may not apply to heavy-tailed distributions
  2. Bounds for p<-1:
    • Regret bounds grow exponentially with |p|, becoming vacuous when T≤p²k^|p|
    • Though superior to prior work in practically relevant T ranges, room for improvement exists
  3. Missing Lower Bounds:
    • Lack matching lower bounds for p<-1 interval
    • Cannot determine if current upper bounds are tight
  4. Experimental Scale:
    • Experiments primarily conducted at k=50 (moderate scale)
    • Large-scale performance (k≫50) insufficiently explored

Future Directions

  1. Theoretical Refinement:
    • Prove matching lower bounds for p<-1, formalizing "no-free-lunch" principle
    • Explore further improvements to p<-1 bounds
  2. Assumption Relaxation:
    • Study handling of negative μᵢ cases
    • Extend to heavy-tailed or non-sub-Gaussian distributions
  3. Algorithm Extensions:
    • Generalize to contextual bandits or reinforcement learning settings
    • Combine with other fairness concepts (e.g., individual fairness)
  4. Practical Applications:
    • Validate in real clinical trials or resource allocation scenarios
    • Develop adaptive p value selection methods
  5. Computational Efficiency:
    • Phase I termination condition checking may be computationally intensive; explore more efficient implementations

In-Depth Evaluation

Strengths

  1. Strong Theoretical Innovation:
    • Data-adaptive termination condition design is clever, solving UCB's failure in Nash regret minimization
    • Using additive rather than multiplicative Hoeffding bounds is key technical breakthrough, significantly expanding applicability
    • Elegant and powerful framework handling all p values uniformly
  2. Sufficient Experimental Validation:
    • Covers multiple p intervals (p>0, p=0, -1<p<0, p<-1)
    • Compares against multiple baselines (NCB, Explore-then-UCB)
    • Includes ablation studies validating "no-free-lunch" hypothesis
  3. Convincing Results:
    • Theoretical bounds are order-optimal or near-optimal in most cases
    • Experimental results highly consistent with theoretical predictions
    • Rigorous mathematical proofs with clear technical lemmas (e.g., Lemma 4.1, 4.2)
  4. Writing Clarity:
    • Clear motivation and precise problem definition
    • Proof technique section (Section 4) provides intuitive explanations
    • Detailed and fair comparisons with prior work (Remarks 3.1, 3.2, 3.3)
  5. Reproducibility:
    • Complete code repository provided
    • Clear algorithm pseudocode (Algorithm 1)
    • Detailed experimental setup description

Weaknesses

  1. Method Limitations:
    • μᵢ≥0 assumption restrictive in some applications (though Remark 2.1 provides reasonable justification)
    • Phase I termination condition involves p² term; very large |p| may cause excessively long exploration
    • Bounds for p<-1 less tight than other intervals
  2. Experimental Setup Deficiencies:
    • Only synthetic data; lacks real dataset validation
    • k=50 scale relatively small; large-scale performance (k=1000+) unknown
    • Unexplored impact of σ value variations on algorithm performance
  3. Insufficient Analysis:
    • Missing experimental statistics on actual Phase I termination time (theoretical bounds are 32kS to 128kS; what is actual distribution?)
    • Computational complexity not discussed (especially termination condition checking)
    • Insufficient analysis of algorithm behavior under different μ* values
  4. Theoretical Gaps:
    • Missing lower bounds for p<-1, cannot judge upper bound tightness
    • Unexplored how to select σ² when μ* unknown (algorithm requires σ² as input)
    • Necessity of log k factor in Nash regret bounds insufficiently discussed

Impact

  1. Field Contribution:
    • Significantly advances theoretical understanding of fairness-aware bandit algorithms
    • Proves classical algorithms (UCB) applicable to new problems, encouraging re-examination of simple methods
    • Provides solid theoretical foundation for p-mean welfare applications in online learning
  2. Practical Value:
    • Simple algorithm, easy to implement and deploy
    • Applicable to broad distribution types, enhancing practical potential
    • Provides quantitative fairness-utility tradeoff characterization, guiding practical p selection
  3. Reproducibility:
    • Open-source code, clear algorithm description
    • Complete theoretical proofs; technical lemmas available for future work reference
    • Standardized experimental setup, easily reproducible and extensible
  4. Inspirational Value:
    • "No-free-lunch" principle discovery provides profound insight
    • Data-adaptive exploration ideas may inspire research on other bandit variants
    • Additive vs. multiplicative concentration inequality discussion has methodological significance

Applicable Scenarios

  1. Clinical Trials: Exploring effective treatments while ensuring fair patient treatment
  2. Resource Allocation: Online resource distribution (computing, ad slots) to different users, balancing efficiency and fairness
  3. Recommendation Systems: Recommending content to users at different times, avoiding extremely poor experiences for certain time periods
  4. A/B Testing: Product testing considering participant welfare, not just average metrics
  5. Educational Resource Allocation: Distributing educational resources to students across cohorts, ensuring cross-cohort fairness

Inapplicable Scenarios:

  • Applications requiring extreme Rawlsian fairness (p→-∞) in short term (bounds become vacuous)
  • Heavy-tailed or non-sub-Gaussian reward distributions
  • Applications where μᵢ may be significantly negative

Selected References

  1. Barman et al. (2023): "Fairness and welfare quantification for regret in multi-armed bandits" - First proposed Nash regret concept
  2. Krishna et al. (2025): "p-mean regret for stochastic bandits" - Studied general p-mean regret
  3. Bubeck et al. (2012): "Regret analysis of stochastic and nonstochastic multi-armed bandit problems" - Classical MAB textbook
  4. Moulin (2004): "Fair division and collective welfare" - Axiomatic foundation of p-mean welfare
  5. Sawarni et al. (2024): "Nash regret guarantees for linear bandits" - Nash regret in linear bandits

Overall Assessment: This is a high-quality theoretical machine learning paper making important contributions to fairness-aware bandit algorithms. Through clever algorithm design and rigorous theoretical analysis, it proves simple methods (UCB) effective for complex objectives (fairness). Strong theoretical results with sufficient experimental validation significantly advance the field. Main weaknesses are lack of real data validation and certain theoretical gaps (e.g., p<-1 lower bounds). Recommend future work focus on practical application validation and theoretical refinement.