Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need
Sarkar, Pandey, Chowdhury
Regret in stochastic multi-armed bandits traditionally measures the difference between the highest reward and either the arithmetic mean of accumulated rewards or the final reward. These conventional metrics often fail to address fairness among agents receiving rewards, particularly in settings where rewards are distributed across a population, such as patients in clinical trials. To address this, a recent body of work has introduced Nash regret, which evaluates performance via the geometric mean of accumulated rewards, aligning with the Nash social welfare function known for satisfying fairness axioms.
To minimize Nash regret, existing approaches require specialized algorithm designs and strong assumptions, such as multiplicative concentration inequalities and bounded, non-negative rewards, making them unsuitable for even Gaussian reward distributions. We demonstrate that an initial uniform exploration phase followed by a standard Upper Confidence Bound (UCB) algorithm achieves near-optimal Nash regret, while relying only on additive Hoeffding bounds, and naturally extending to sub-Gaussian rewards. Furthermore, we generalize the algorithm to a broad class of fairness metrics called the $p$-mean regret, proving (nearly) optimal regret bounds uniformly across all $p$ values. This is in contrast to prior work, which made extremely restrictive assumptions on the bandit instances and even then achieved suboptimal regret bounds.
academic
Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need
This paper investigates fairness issues in the Multi-Armed Bandits (MAB) problem. Traditional regret metrics based on arithmetic means cannot guarantee fairness across users. To address this, the community introduced Nash regret metrics based on geometric means. However, existing methods for minimizing Nash regret require specialized algorithm design and strong assumptions (such as multiplicative concentration inequalities, bounded non-negative rewards), and even fail to apply to Gaussian distributions. This paper demonstrates that through data-adaptive uniform exploration followed by standard UCB algorithms, one can achieve near-optimal Nash regret while relying only on additive Hoeffding bounds, naturally extending to sub-Gaussian rewards. Furthermore, the paper generalizes the algorithm to p-mean regret, a broad class of fairness metrics, proving (near-)optimal regret bounds across all p values without the stringent assumptions of prior work.
Core Problem: In multi-armed bandit problems, how can one maximize cumulative rewards while ensuring fairness across time steps (corresponding to different users/patients)? Traditional average regret only concerns overall utility and may result in extremely low rewards at certain time steps, lacking fairness guarantees.
Importance: In clinical trials and resource allocation scenarios, each time step corresponds to an independent individual (e.g., a patient). Optimizing only average utility may lead to unfair treatment of some individuals, which is unacceptable from ethical and social welfare perspectives.
Limitations of Existing Methods:
Barman et al. (2023) proposed the Nash Confidence Bound (NCB) algorithm but relies on multiplicative Hoeffding/Chernoff bounds, requiring bounded non-negative rewards and inapplicable to general distributions like Gaussian. It implicitly requires knowledge of an upper bound on μ*.
Krishna et al. (2025) studied p-mean regret but requires each arm's expected reward to be at least Ω̃(√k/T^(1/4)), an extremely restrictive assumption that excludes many practical scenarios and yields suboptimal regret bounds for certain p values.
Research Motivation: Develop a general framework using classical UCB algorithms to achieve fairness objectives without specialized confidence bound design, applicable to general sub-Gaussian distributions without unrealistic assumptions on unknown means.
Reduction Framework: Proposes reducing Nash/p-mean regret minimization to brief data-adaptive uniform exploration followed by standard bandit algorithms (e.g., UCB), with the key innovation being a data-adaptive stopping rule.
Algorithm Design: Designs the Welfarist-UCB algorithm with a two-phase strategy:
Phase I: Uniform exploration until satisfying adaptive termination conditions
Phase II: Exploration-exploitation using standard UCB indices
Theoretical Guarantees:
For Nash regret (p=0): Achieves Õ(σ√(k/T)) order-optimal bounds applicable to σ-sub-Gaussian rewards
For p∈0,1: Achieves Õ(σ√(k/T)) order-optimal bounds
For p∈[-1,0): Achieves Õ(k/√T), improving upon Krishna et al.'s Õ(k^(3/4)T^(-1/4))
For p<-1: Achieves Õ(|p|k^((|p|+1)/2)/√T), strictly superior to prior work in practically relevant T ranges
Technical Innovation: Uses additive Hoeffding bounds rather than multiplicative bounds, avoiding strict restrictions on reward distributions and eliminating the need for upper bounds on μ*.
Experimental Validation: Numerical simulations verify algorithm effectiveness across different p values, demonstrating the "no-free-lunch" principle: stronger fairness requirements lead to higher regret.
Innovation: The denominator (μ^i−2ni2σ2logT)2 in the termination condition ensures Phase I terminates after Θ(1/(μ*)²) rounds, rather than fixed Θ(√T) or Θ(1/μ*) rounds.
Rationale:
When μ* is large, fewer exploration rounds are needed to identify good arms
When μ* is small, more exploration is needed to distinguish between arms
This adaptivity ensures that for any pulled arm i in Phase II, ξi:=μ∗ni22σ2logT≤1/2, which is key to controlling Nash regret
Practical Effectiveness: Welfarist-UCB demonstrates superior practical performance across all tested scenarios
Distribution Robustness: Algorithm works effectively for different reward distributions (Bernoulli, Gaussian), validating broad theoretical applicability
Fairness-Utility Tradeoff: Experiments clearly demonstrate the tradeoff relationship between fairness parameter p and regret, guiding practical p selection
Convergence Rate: Across all p intervals, Welfarist-UCB's regret decrease rate matches theoretical predictions and outperforms existing methods
Theoretical Contribution: Proves that classical UCB combined with data-adaptive exploration achieves near-optimal fairness guarantees without specialized confidence bound design
Practical Significance: Makes fairness-aware sequential decision-making more practical and broadly applicable, especially for scenarios requiring general distributions like Gaussian
"No-Free-Lunch" Principle: Stricter fairness requirements inherently increase learning problem difficulty, requiring more samples for low regret
Barman et al. (2023): "Fairness and welfare quantification for regret in multi-armed bandits" - First proposed Nash regret concept
Krishna et al. (2025): "p-mean regret for stochastic bandits" - Studied general p-mean regret
Bubeck et al. (2012): "Regret analysis of stochastic and nonstochastic multi-armed bandit problems" - Classical MAB textbook
Moulin (2004): "Fair division and collective welfare" - Axiomatic foundation of p-mean welfare
Sawarni et al. (2024): "Nash regret guarantees for linear bandits" - Nash regret in linear bandits
Overall Assessment: This is a high-quality theoretical machine learning paper making important contributions to fairness-aware bandit algorithms. Through clever algorithm design and rigorous theoretical analysis, it proves simple methods (UCB) effective for complex objectives (fairness). Strong theoretical results with sufficient experimental validation significantly advance the field. Main weaknesses are lack of real data validation and certain theoretical gaps (e.g., p<-1 lower bounds). Recommend future work focus on practical application validation and theoretical refinement.