2025-11-24T01:22:17.846878

Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need

Sarkar, Pandey, Chowdhury

Regret in stochastic multi-armed bandits traditionally measures the difference between the highest reward and either the arithmetic mean of accumulated rewards or the final reward. These conventional metrics often fail to address fairness among agents receiving rewards, particularly in settings where rewards are distributed across a population, such as patients in clinical trials. To address this, a recent body of work has introduced Nash regret, which evaluates performance via the geometric mean of accumulated rewards, aligning with the Nash social welfare function known for satisfying fairness axioms. To minimize Nash regret, existing approaches require specialized algorithm designs and strong assumptions, such as multiplicative concentration inequalities and bounded, non-negative rewards, making them unsuitable for even Gaussian reward distributions. We demonstrate that an initial uniform exploration phase followed by a standard Upper Confidence Bound (UCB) algorithm achieves near-optimal Nash regret, while relying only on additive Hoeffding bounds, and naturally extending to sub-Gaussian rewards. Furthermore, we generalize the algorithm to a broad class of fairness metrics called the $p$-mean regret, proving (nearly) optimal regret bounds uniformly across all $p$ values. This is in contrast to prior work, which made extremely restrictive assumptions on the bandit instances and even then achieved suboptimal regret bounds.

academic

Basic Information

Paper ID: 2510.21312
Title: Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need
Authors: Dhruv Sarkar (IIT Kharagpur), Nishant Pandey (IIT Kanpur), Sayak Ray Chowdhury (IIT Kanpur)
Category: cs.LG (Machine Learning)
Publication Date: October 24, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.21312
Code Link: https://github.com/NP-Hardest/UCBisAllYouNeed

Abstract

This paper investigates fairness issues in the Multi-Armed Bandits (MAB) problem. Traditional regret metrics based on arithmetic means cannot guarantee fairness across users. To address this, the community introduced Nash regret metrics based on geometric means. However, existing methods for minimizing Nash regret require specialized algorithm design and strong assumptions (such as multiplicative concentration inequalities, bounded non-negative rewards), and even fail to apply to Gaussian distributions. This paper demonstrates that through data-adaptive uniform exploration followed by standard UCB algorithms, one can achieve near-optimal Nash regret while relying only on additive Hoeffding bounds, naturally extending to sub-Gaussian rewards. Furthermore, the paper generalizes the algorithm to p-mean regret, a broad class of fairness metrics, proving (near-)optimal regret bounds across all p values without the stringent assumptions of prior work.

Research Background and Motivation

Problem Definition

Core Problem: In multi-armed bandit problems, how can one maximize cumulative rewards while ensuring fairness across time steps (corresponding to different users/patients)? Traditional average regret only concerns overall utility and may result in extremely low rewards at certain time steps, lacking fairness guarantees.
Importance: In clinical trials and resource allocation scenarios, each time step corresponds to an independent individual (e.g., a patient). Optimizing only average utility may lead to unfair treatment of some individuals, which is unacceptable from ethical and social welfare perspectives.
Limitations of Existing Methods:
- Barman et al. (2023) proposed the Nash Confidence Bound (NCB) algorithm but relies on multiplicative Hoeffding/Chernoff bounds, requiring bounded non-negative rewards and inapplicable to general distributions like Gaussian. It implicitly requires knowledge of an upper bound on μ*.
- Krishna et al. (2025) studied p-mean regret but requires each arm's expected reward to be at least Ω̃(√k/T^(1/4)), an extremely restrictive assumption that excludes many practical scenarios and yields suboptimal regret bounds for certain p values.
Research Motivation: Develop a general framework using classical UCB algorithms to achieve fairness objectives without specialized confidence bound design, applicable to general sub-Gaussian distributions without unrealistic assumptions on unknown means.

Core Contributions

Reduction Framework: Proposes reducing Nash/p-mean regret minimization to brief data-adaptive uniform exploration followed by standard bandit algorithms (e.g., UCB), with the key innovation being a data-adaptive stopping rule.
Algorithm Design: Designs the Welfarist-UCB algorithm with a two-phase strategy:
- Phase I: Uniform exploration until satisfying adaptive termination conditions
- Phase II: Exploration-exploitation using standard UCB indices
Theoretical Guarantees:
- For Nash regret (p=0): Achieves Õ(σ√(k/T)) order-optimal bounds applicable to σ-sub-Gaussian rewards
- For p∈0,1: Achieves Õ(σ√(k/T)) order-optimal bounds
- For p∈[-1,0): Achieves Õ(k/√T), improving upon Krishna et al.'s Õ(k^(3/4)T^(-1/4))
- For p<-1: Achieves Õ(|p|k^((|p|+1)/2)/√T), strictly superior to prior work in practically relevant T ranges
Technical Innovation: Uses additive Hoeffding bounds rather than multiplicative bounds, avoiding strict restrictions on reward distributions and eliminating the need for upper bounds on μ*.
Experimental Validation: Numerical simulations verify algorithm effectiveness across different p values, demonstrating the "no-free-lunch" principle: stronger fairness requirements lead to higher regret.

Methodology Details

Task Definition

Input:

k arms with reward distributions ρᵢ for arm i being σ-sub-Gaussian with mean μᵢ≥0 (unknown)
Time horizon T
Fairness parameter p∈(-∞,1]

Output: Arm Iₜ selected at each round t∈T

Objective: Minimize p-mean regret $R^p_T := \mu^* - \left(\frac{1}{T}\sum_{t=1}^T (\mathbb{E}[\mu_{I_t}])^p\right)^{1/p}$ where μ*=maxᵢμᵢ. In particular, p=0 corresponds to Nash regret (geometric mean).

Model Architecture

Phase I: Data-Adaptive Uniform Exploration

Exploration Strategy:

Each k steps form a block
At the start of each block, uniformly randomly sample a permutation π∈Πₖ of arms
Pull arms in order π(1), π(2), ..., π(k)

Termination Condition: Phase I terminates when some arm i satisfies both:

$\hat{\mu}_i > 2\sqrt{\frac{2\sigma^2\log T}{n_i}}$
$n_i > \frac{192p_a^2\sigma^2\log T}{(\hat{\mu}_i - 2\sqrt{\frac{2\sigma^2\log T}{n_i}})^2}$

where:

nᵢ: number of times arm i has been pulled
μ̂ᵢ: empirical mean of observed rewards from arm i
pₐ = 1 (if p≥-1), pₐ = p (if p<-1)

Key Property (Lemma 4.1): Under high-probability event E, Phase I duration τ satisfies $32kS \leq \tau \leq 128kS$ where $S = \frac{4p_a^2\sigma^2\log T}{(\mu^*)^2}$

This means each arm is pulled Θ(1/(μ*)²) times, which is crucial for subsequent analysis.

Phase II: UCB Exploration-Exploitation

Uses standard UCB index: $\text{UCB}_i = \hat{\mu}_i + 2\sqrt{\frac{2\sigma^2\log T}{n_i}}$

At each round, select the arm with maximum UCB value.

Technical Innovations

1. Design of Data-Adaptive Termination Condition

Innovation: The denominator $(\hat{\mu}_i - 2\sqrt{\frac{2\sigma^2\log T}{n_i}})^2$ in the termination condition ensures Phase I terminates after Θ(1/(μ*)²) rounds, rather than fixed Θ(√T) or Θ(1/μ*) rounds.

Rationale:

When μ* is large, fewer exploration rounds are needed to identify good arms
When μ* is small, more exploration is needed to distinguish between arms
This adaptivity ensures that for any pulled arm i in Phase II, $\xi_i := \frac{2\sqrt{2\sigma^2\log T}}{\mu^*\sqrt{n_i}} \leq 1/2$ , which is key to controlling Nash regret

2. Using Additive Hoeffding Bounds

Comparison with NCB:

NCB: Conditions on event $\{\forall i, \mu_i \in [\hat{\mu}_i - 4\sqrt{\frac{\hat{\mu}_i\log T}{n_i}}, \hat{\mu}_i + 4\sqrt{\frac{\hat{\mu}_i\log T}{n_i}}]\}$ , requiring multiplicative Hoeffding bounds
This paper: Conditions on event $\{\forall i, \mu_i \in [\hat{\mu}_i - 2\sqrt{\frac{2\sigma^2\log T}{n_i}}, \hat{\mu}_i + 2\sqrt{\frac{2\sigma^2\log T}{n_i}}]\}$ , requiring only additive Hoeffding bounds

Advantages:

Additive bounds apply to any sub-Gaussian distribution (including Gaussian, bounded, etc.)
No requirement for strictly non-negative or bounded rewards
No need to pre-know upper bounds on μ*

3. Permutation Sampling Equivalence (Lemma B.3)

Key Observation: Phase I's permutation sampling strategy is marginally equivalent to independent uniform sampling at each round, i.e., PrIₜ=i=1/k, thus 𝔼μ_{Iₜ}≥μ*/k.

Technical Significance: This static coupling simplifies Phase I analysis, allowing direct use of uniform sampling properties.

Experimental Setup

Datasets

This paper uses synthetic data for numerical simulations:

Bernoulli Arms: k=50 arms with means uniformly sampled from 0.005, 1
Gaussian Arms: k=50 arms with means uniformly sampled from 10, 1000, standard deviation σ=20

Evaluation Metrics

Nash Regret (p=0): $\mu^* - (\prod_{t=1}^T \mathbb{E}[\mu_{I_t}])^{1/T}$
p-mean Regret: $\mu^* - (\frac{1}{T}\sum_{t=1}^T (\mathbb{E}[\mu_{I_t}])^p)^{1/p}$

Estimate 𝔼μ_{Iₜ} through average rewards over 50 runs.

Baseline Methods

NCB (Barman et al., 2023): Using Nash Confidence Bound index
Explore-then-UCB (Krishna et al., 2025): UCB after fixed exploration phase

Implementation Details

Time horizon T gradually increased from small to large values
Test different p values (p=0.5, 0, -0.5, -1.5)
All experiments reproducible through provided code repository

Experimental Results

Main Results

1. Nash Regret (p=0)

Bernoulli Rewards (Figure 1a):

Welfarist-UCB's Nash regret decreases significantly faster than NCB as T increases
Verifies theoretical Õ(√(k/T)) convergence rate

Sub-Gaussian Rewards (Figure 1b):

Welfarist-UCB effectively minimizes Nash regret even under Gaussian distribution
NCB performs poorly in this setting, validating the method's applicability to general distributions

2. p-mean Regret Across Three Intervals

p=0.5 (0<p≤1) (Figure 1c):

Welfarist-UCB outperforms Explore-then-UCB across all T values
Regret decrease rate matches theoretical prediction of Õ(√(k/T))

p=-0.5 (-1<p<0) (Figure 1d):

Welfarist-UCB significantly outperforms Explore-then-UCB
Validates Õ(k/√T) bound improvement over Krishna et al.'s Õ(k^(3/4)T^(-1/4))

p=-1.5 (p≤-1) (Figure 1e):

Stricter fairness requirements lead to higher regret
Welfarist-UCB still outperforms baseline methods

3. Ablation Study: Effect of p Value (Figure 1f)

Key Findings:

As p decreases (stronger fairness requirements), p-mean regret continuously increases
Validates "no-free-lunch" hypothesis: stronger fairness comes at higher regret cost
As p→-∞ (Rawlsian fairness), regret bounds become vacuous, indicating extreme fairness is unattainable in short term

Experimental Findings

Practical Effectiveness: Welfarist-UCB demonstrates superior practical performance across all tested scenarios
Distribution Robustness: Algorithm works effectively for different reward distributions (Bernoulli, Gaussian), validating broad theoretical applicability
Fairness-Utility Tradeoff: Experiments clearly demonstrate the tradeoff relationship between fairness parameter p and regret, guiding practical p selection
Convergence Rate: Across all p intervals, Welfarist-UCB's regret decrease rate matches theoretical predictions and outperforms existing methods

1. Fairness in Multi-Armed Bandits

Different Fairness Concepts:

Arm Fairness (Joseph et al. 2016; Patil et al. 2021): Ensuring different arms are treated fairly
Multi-Agent Fairness (Hossain et al. 2021; Jones et al. 2023): Fair reward distribution to multiple agents per pull
This Paper's Focus: Cross-temporal fairness, where each time step corresponds to an independent individual

2. Nash Regret

Barman et al. (2023):

First proposed Nash regret concept
Designed NCB algorithm achieving Õ(√(k/T)) bounds
Limitations: Requires multiplicative concentration inequalities, restricted to bounded non-negative rewards

3. p-mean Welfare

Theoretical Foundation (Moulin 2004):

p-mean functions defined by five axioms: anonymity, scale invariance, continuity, monotonicity, symmetry
Satisfies Pigou-Dalton transfer principle (when p≤1)

Krishna et al. (2025):

Studied general p-mean regret
Used Explore-then-UCB approach
Limitations: Requires μᵢ≥Ω̃(√k/T^(1/4)), suboptimal bounds in certain intervals

Nash Regret in Linear Bandits (Sawarni et al. 2024)
Online Nash Social Welfare Maximization (Zhang et al. 2024)
Fairness in Multi-Agent Reinforcement Learning (Mandal & Gan 2022)
Privacy-Fairness Intersection (Sarkar et al. 2025): DP-NCB framework

Weaker Assumptions: Only requires μᵢ≥0 and sub-Gaussianity, no lower bounds on μᵢ or reward boundedness
Better Bounds: Achieves Õ(k/√T) for p∈[-1,0), improving upon prior Õ(k^(3/4)T^(-1/4))
Unified Framework: Single algorithm handles all p values without p-specific design
Technical Simplicity: Uses standard UCB and additive Hoeffding bounds, avoiding complex specialized designs

Conclusions and Discussion

Main Conclusions

Theoretical Contribution: Proves that classical UCB combined with data-adaptive exploration achieves near-optimal fairness guarantees without specialized confidence bound design
Practical Significance: Makes fairness-aware sequential decision-making more practical and broadly applicable, especially for scenarios requiring general distributions like Gaussian
"No-Free-Lunch" Principle: Stricter fairness requirements inherently increase learning problem difficulty, requiring more samples for low regret

Limitations

Assumption Constraints:
- Still requires μᵢ≥0 (though standard, may be restrictive in some applications)
- Sub-Gaussian assumption may not apply to heavy-tailed distributions
Bounds for p<-1:
- Regret bounds grow exponentially with |p|, becoming vacuous when T≤p²k^|p|
- Though superior to prior work in practically relevant T ranges, room for improvement exists
Missing Lower Bounds:
- Lack matching lower bounds for p<-1 interval
- Cannot determine if current upper bounds are tight
Experimental Scale:
- Experiments primarily conducted at k=50 (moderate scale)
- Large-scale performance (k≫50) insufficiently explored

Future Directions

Theoretical Refinement:
- Prove matching lower bounds for p<-1, formalizing "no-free-lunch" principle
- Explore further improvements to p<-1 bounds
Assumption Relaxation:
- Study handling of negative μᵢ cases
- Extend to heavy-tailed or non-sub-Gaussian distributions
Algorithm Extensions:
- Generalize to contextual bandits or reinforcement learning settings
- Combine with other fairness concepts (e.g., individual fairness)
Practical Applications:
- Validate in real clinical trials or resource allocation scenarios
- Develop adaptive p value selection methods
Computational Efficiency:
- Phase I termination condition checking may be computationally intensive; explore more efficient implementations

In-Depth Evaluation

Strengths

Strong Theoretical Innovation:
- Data-adaptive termination condition design is clever, solving UCB's failure in Nash regret minimization
- Using additive rather than multiplicative Hoeffding bounds is key technical breakthrough, significantly expanding applicability
- Elegant and powerful framework handling all p values uniformly
Sufficient Experimental Validation:
- Covers multiple p intervals (p>0, p=0, -1<p<0, p<-1)
- Compares against multiple baselines (NCB, Explore-then-UCB)
- Includes ablation studies validating "no-free-lunch" hypothesis
Convincing Results:
- Theoretical bounds are order-optimal or near-optimal in most cases
- Experimental results highly consistent with theoretical predictions
- Rigorous mathematical proofs with clear technical lemmas (e.g., Lemma 4.1, 4.2)
Writing Clarity:
- Clear motivation and precise problem definition
- Proof technique section (Section 4) provides intuitive explanations
- Detailed and fair comparisons with prior work (Remarks 3.1, 3.2, 3.3)
Reproducibility:
- Complete code repository provided
- Clear algorithm pseudocode (Algorithm 1)
- Detailed experimental setup description

Weaknesses

Method Limitations:
- μᵢ≥0 assumption restrictive in some applications (though Remark 2.1 provides reasonable justification)
- Phase I termination condition involves p² term; very large |p| may cause excessively long exploration
- Bounds for p<-1 less tight than other intervals
Experimental Setup Deficiencies:
- Only synthetic data; lacks real dataset validation
- k=50 scale relatively small; large-scale performance (k=1000+) unknown
- Unexplored impact of σ value variations on algorithm performance
Insufficient Analysis:
- Missing experimental statistics on actual Phase I termination time (theoretical bounds are 32kS to 128kS; what is actual distribution?)
- Computational complexity not discussed (especially termination condition checking)
- Insufficient analysis of algorithm behavior under different μ* values
Theoretical Gaps:
- Missing lower bounds for p<-1, cannot judge upper bound tightness
- Unexplored how to select σ² when μ* unknown (algorithm requires σ² as input)
- Necessity of log k factor in Nash regret bounds insufficiently discussed

Impact

Field Contribution:
- Significantly advances theoretical understanding of fairness-aware bandit algorithms
- Proves classical algorithms (UCB) applicable to new problems, encouraging re-examination of simple methods
- Provides solid theoretical foundation for p-mean welfare applications in online learning
Practical Value:
- Simple algorithm, easy to implement and deploy
- Applicable to broad distribution types, enhancing practical potential
- Provides quantitative fairness-utility tradeoff characterization, guiding practical p selection
Reproducibility:
- Open-source code, clear algorithm description
- Complete theoretical proofs; technical lemmas available for future work reference
- Standardized experimental setup, easily reproducible and extensible
Inspirational Value:
- "No-free-lunch" principle discovery provides profound insight
- Data-adaptive exploration ideas may inspire research on other bandit variants
- Additive vs. multiplicative concentration inequality discussion has methodological significance

Applicable Scenarios

Clinical Trials: Exploring effective treatments while ensuring fair patient treatment
Resource Allocation: Online resource distribution (computing, ad slots) to different users, balancing efficiency and fairness
Recommendation Systems: Recommending content to users at different times, avoiding extremely poor experiences for certain time periods
A/B Testing: Product testing considering participant welfare, not just average metrics
Educational Resource Allocation: Distributing educational resources to students across cohorts, ensuring cross-cohort fairness

Inapplicable Scenarios:

Applications requiring extreme Rawlsian fairness (p→-∞) in short term (bounds become vacuous)
Heavy-tailed or non-sub-Gaussian reward distributions
Applications where μᵢ may be significantly negative

Selected References

Barman et al. (2023): "Fairness and welfare quantification for regret in multi-armed bandits" - First proposed Nash regret concept
Krishna et al. (2025): "p-mean regret for stochastic bandits" - Studied general p-mean regret
Bubeck et al. (2012): "Regret analysis of stochastic and nonstochastic multi-armed bandit problems" - Classical MAB textbook
Moulin (2004): "Fair division and collective welfare" - Axiomatic foundation of p-mean welfare
Sawarni et al. (2024): "Nash regret guarantees for linear bandits" - Nash regret in linear bandits

Overall Assessment: This is a high-quality theoretical machine learning paper making important contributions to fairness-aware bandit algorithms. Through clever algorithm design and rigorous theoretical analysis, it proves simple methods (UCB) effective for complex objectives (fairness). Strong theoretical results with sufficient experimental validation significantly advance the field. Main weaknesses are lack of real data validation and certain theoretical gaps (e.g., p<-1 lower bounds). Recommend future work focus on practical application validation and theoretical refinement.