2025-11-27T07:22:18.939551

Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Xu, Liu, Mattei et al.
We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.
academic

Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Basic Information

  • Paper ID: 2506.14988
  • Title: Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits
  • Authors: Tianyi Xu, Jiaxin Liu, Nicholas Mattei, Zizhan Zheng
  • Institutions: Tulane University, University of Illinois Urbana-Champaign
  • Classification: cs.LG, cs.AI
  • Publication Date: arXiv preprint, version from November 26, 2025
  • Paper Link: https://arxiv.org/abs/2506.14988v4

Abstract

This paper proposes a multi-agent multi-armed bandit (MA-MAB) framework designed to ensure fairness among agents while maximizing overall system performance. To address decision-making challenges with limited arm reward information, a novel probing mechanism is introduced that strategically collects information about selected arms before allocation. In the offline setting (known reward distributions), a greedy probing algorithm with constant-factor approximation guarantees is designed by leveraging submodularity. In the online setting, an algorithm is developed that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets demonstrate that the approach outperforms baseline methods in both fairness and efficiency.

Research Background and Motivation

Core Problem

Traditional multi-armed bandit problems typically optimize utilitarian welfare (i.e., sum of total rewards), but this leads to severe fairness issues. For example, in ride-sharing scenarios, when a central dispatch system allocates drivers to different geographic regions, optimizing total revenue may result in some drivers receiving no orders at all, causing "starvation" phenomena.

Problem Importance

Fair resource allocation is crucial in many practical applications:

  1. Ride-sharing Platforms: Drivers need fair access to profitable regions
  2. Content Recommendation Systems: Creator exposure should not be monopolized
  3. Wireless Network Scheduling: Client devices require fair bandwidth allocation

Limitations of Existing Methods

  1. Fairness Deficiency: Existing MA-MAB methods primarily focus on total reward maximization, neglecting individual fairness
  2. Information Dependency: Relies on immediate feedback to update estimates, performing poorly in incomplete information or noisy environments
  3. Insufficient Exploration: Lacks active information collection mechanisms, leading to error propagation from uncertain arms to allocation decisions

Research Motivation

This paper bridges the gap between fair multi-agent approaches and active exploration strategies by introducing a probing mechanism and Nash Social Welfare (NSW) objectives, enabling information acquired through active exploration to directly translate into fair outcomes for all agents.

Core Contributions

  1. Novel Framework: Extends the MA-MAB framework by introducing a probing mechanism to test selected arms before allocation, ensuring fairness through NSW optimization
  2. Offline Algorithm: Develops a simple yet effective greedy probing strategy with provable performance guarantees for known reward patterns
  3. Online Algorithm: Proposes an algorithm balancing exploration and fairness, proving that probing and fair allocation do not compromise asymptotic performance (sublinear regret)
  4. Experimental Validation: Demonstrates superiority over baselines in both fairness and efficiency on synthetic and real data

Methodology Details

Task Definition

Basic Setup:

  • Agents: M agents, denoted j ∈ M
  • Arms: A arms, denoted a ∈ A
  • Reward Distributions: Each agent-arm pair (j,a) has an unknown distribution D_{j,a} with mean μ_{j,a} ∈ 0,1

Decision Process (each round t):

  1. Probing Phase: Select probing set S_t ⊆ A, obtaining fresh reward samples R_{j,a,t} for each a ∈ S_t
  2. Allocation Phase: Based on observed rewards and current estimates, allocate arms to agents via policy π_t

Fairness Objective: Maximize Nash Social Welfare (NSW) NSW(St,Rt,μ,πt)=j[M](aStπj,a,tRj,a,t+aStπj,a,tμj,a)\text{NSW}(S_t, R_t, \mu, \pi_t) = \prod_{j \in [M]} \left( \sum_{a \in S_t} \pi_{j,a,t} R_{j,a,t} + \sum_{a \notin S_t} \pi_{j,a,t} \mu_{j,a} \right)

Probing Cost: Effective reward is Rttotal=(1α(St))ERt[NSW(St,Rt,μ,πt)]R^{\text{total}}_t = (1 - \alpha(|S_t|)) \mathbb{E}_{R_t}[\text{NSW}(S_t, R_t, \mu, \pi_t)] where α(·) is a non-decreasing cost function with α(0)=0, α(I)=1

Regret Definition: Rregret(T)=t=1T[(1α(St))E[NSW(St,Rt,μ,πt)]Rttotal]R_{\text{regret}}(T) = \sum_{t=1}^T \left[ (1-\alpha(|S^*_t|)) \mathbb{E}[\text{NSW}(S^*_t, R_t, \mu, \pi^*_t)] - R^{\text{total}}_t \right]

Offline Setting: Greedy Probing Algorithm

Optimization Objective Decomposition

Define two utility functions:

  1. Probing Utility g(S): Optimal NSW expectation using only probed arms g(S)=maxπΔSAE[j[M](aSπj,aRj,a)]g(S) = \max_{\pi \in \Delta^A_S} \mathbb{E}\left[ \prod_{j \in [M]} \left( \sum_{a \in S} \pi_{j,a} R_{j,a} \right) \right]
  2. Non-Probing Utility h(S): Optimal NSW using only unprobed arms h(S)=maxπΔ[A]\SAj[M](aSπj,aμj,a)h(S) = \max_{\pi \in \Delta^A_{[A]\backslash S}} \prod_{j \in [M]} \left( \sum_{a \notin S} \pi_{j,a} \mu_{j,a} \right)

Piecewise Linear Upper Bound Construction

Since log(g(S)) is non-submodular, direct optimization is difficult. A piecewise linear envelope approximation is adopted:

  • Partition interval 0, x_max into fine-grained segments with breakpoints τ_0, τ_1, ..., τ_L
  • Construct tangent lines at each breakpoint: T_{τ_i}(z) = log(τ_i) + (z - τ_i)/τ_i
  • Define φ(z) = max_{0≤i<L} T_{τ_i}(z), satisfying φ(z) ≥ log(z)
  • Set f_upper(S) = φ(g(S))

Submodularity Properties

Lemmas 1-5 establish key properties:

  • Lemma 1: g(S) is monotone increasing
  • Lemma 2: f_upper(S) is monotone increasing
  • Lemma 3: f_upper(S) is submodular (core property)
  • Lemma 4: h(S) is monotone decreasing
  • Lemma 5: R(S) ≤ (1-α(|S|))(g(S) + h(S))

Algorithm 1: Offline Greedy Probing

Input: Distributions {F_{m,a}}, cost function α(·), budget I, parameter ζ≥1
Output: Probing set S_pr

1. Initialize S_0 = ∅
2. For i = 1 to I:
   - Select arm with maximum marginal gain:
     a = argmax_{a∉S_{i-1}} [f_upper(S_{i-1}∪{a}) - f_upper(S_{i-1})]
   - S_i = S_{i-1} ∪ {a}
3. Sort candidate sets by (1-α(|S_j|))f_upper(S_j) in descending order
4. Iterate through sorted sets:
   - If (1-α(|S_j|))f_upper(S_j) < h(∅), return ∅
   - If (1-α(|S_j|))f_upper(S_j) > ζR(S_j), continue
   - Otherwise return S_j

Theorem 1 (Approximation Guarantee): For any ζ≥1, the returned S_pr satisfies R(Spr)e12e11ζR(S)R(S_{pr}) \geq \frac{e-1}{2e-1} \cdot \frac{1}{\zeta} \cdot R(S^*) This is derived from the (1-1/e) approximation factor for submodular maximization.

Online Setting: OFMUP Algorithm

Algorithm 2: Online Fair Multi-Agent UCB with Probing (OFMUP)

Initialization Phase (t = 1 to MA):

  • Each agent-arm pair is explored at least once
  • Construct empirical CDF F̂_{j,a,t} and mean estimate μ̂_{j,a,t}

Main Loop Phase (t > MA):

  1. Probing Set Selection: Call Algorithm 1 based on current estimates F̂_{j,a,t} to select S_t
  2. Probing Update: Sample arms in S_t, update statistics and confidence upper bounds Uj,a,t=min(μ^j,a,t+wj,a,t,1)U_{j,a,t} = \min(\hat{\mu}_{j,a,t} + w_{j,a,t}, 1) where w_{j,a,t} is confidence width based on Freedman's inequality
  3. Policy Optimization: πt=argmaxπtΔA(1α(St))ERt[NSW(St,Rt,Ut,πt)]\pi_t = \arg\max_{\pi_t \in \Delta^A} (1-\alpha(|S_t|)) \mathbb{E}_{R_t}[\text{NSW}(S_t, R_t, U_t, \pi_t)]
  4. Arm Pulling: Each agent j pulls arms according to π_t, observes rewards and updates

Confidence Bound Construction

Lemma 7 (Concentration Bound): With probability at least 1-δ/2, for all t>A, a∈A, j∈M: μj,aμ^j,a,t2(μ^j,a,tμ^j,a,t2)ln(2MATδ)Nj,a,t+ln(2MATδ)3Nj,a,t=wj,a,t|\mu_{j,a} - \hat{\mu}_{j,a,t}| \leq \sqrt{\frac{2(\hat{\mu}_{j,a,t} - \hat{\mu}^2_{j,a,t}) \ln(\frac{2MAT}{\delta})}{N_{j,a,t}}} + \frac{\ln(\frac{2MAT}{\delta})}{3N_{j,a,t}} = w_{j,a,t}

Technical Innovations

  1. Probing and Fairness Integration: First to combine active probing mechanisms with NSW fairness objectives, unlike prior work optimizing only total rewards
  2. Submodular Approximation Technique: Transforms non-submodular problems into submodular optimization via piecewise linear upper bounds, maintaining tractability
  3. UCB-NSW Fusion: Online algorithm cleverly combines UCB confidence bounds with NSW optimization, balancing exploration-exploitation and fairness
  4. Hierarchical Regret Analysis: Partitions rounds into "large γ" and "small γ" categories, handling high and low uncertainty cases separately

Experimental Setup

Datasets

Synthetic Data:

  • Small Scale: M=12 agents, A=8 arms
  • Medium Scale: M=20 agents, A=10 arms
  • Reward Distributions:
    • Bernoulli distribution (rewards 0 or 1, means in 0.3, 0.8)
    • Discrete distribution (rewards sampled from {0.3, 0.4, 0.5, 0.6, 0.7, 0.8})

Real Data: NYYellowTaxi 2016 dataset

  • Agents: Vehicles (randomly pre-sampled locations)
  • Arms: Discretized pickup locations (0.01°×0.01° grid)
  • Rewards: Based on normalized Manhattan distance from vehicle to pickup point (closer distance yields higher reward)

Evaluation Metrics

  • Cumulative Regret: Computed per formula (1), optimal reward determined via exhaustive search
  • Numerical Stability: Uses geometric mean to aggregate cumulative NSW of each agent's rewards
  • Approximation Verification: Verifies difference between f_upper(S) and log(g(S)) ≤ 0.03

Comparison Methods

  1. Non-Probing: Fair MAB algorithm from Jones et al. (2023), without probing capability, optimizes allocation based only on current information
  2. Random P+A: Randomly selects fixed number of arms to probe, then randomly allocates
  3. Greedy P+A: Uses same greedy probing strategy, but randomly allocates after probing

Implementation Details

Experimental Results

Main Findings

Key Observations from Figure 1:

  1. Small Scale Scenario (M=12, A=8, Bernoulli):
    • OFMUP achieves 85% lower regret than Random P+A at T=3000
    • 60% lower than Greedy P+A
    • Significantly outperforms Non-Probing
  2. Medium Scale Scenario (M=20, A=10, Bernoulli):
    • OFMUP advantage more pronounced
    • 88% lower regret than Random P+A at T=3000
    • 80% lower than Greedy P+A
    • Demonstrates better scalability
  3. Discrete Distribution Scenario:
    • OFMUP consistently outperforms baselines
    • Gap increases as learning reward patterns
    • 85% lower than Random P+A, 65% lower than Non-Probing at T=3000
  4. Random P+A Anomaly:
    • Slightly higher regret than expected in small-scale tests
    • Cause: Increased variance in small samples when computing fairness via geometric mean

Scalability Analysis

Scalability Shown in Figure 2:

  1. Fixed Arms (A=8), Varying Agents:
    • OFMUP performs well across all agent counts
    • Relative advantage grows with problem complexity
  2. Fixed Agents (M=20), Varying Arms:
    • OFMUP advantage more pronounced with increasing arms
    • Demonstrates probing mechanism's value in high-dimensional spaces

Experimental Findings

  1. Critical Role of Probing: Probing plays decisive role in information gathering and guiding allocation
  2. Importance of Fair Allocation: Greedy P+A performs far worse than OFMUP, highlighting importance of fair allocation after probing
  3. Theory Validation: Experimental results verify theoretical analysis—OFMUP effectively balances exploration-exploitation and fairness
  4. Real Data Applicability: Success on NYYellowTaxi data demonstrates practical application potential

Multi-Armed Bandit Foundations

  • Single-Agent MAB: Lai & Robbins (1985), Garivier & Cappé (2011)
  • Multi-Agent MAB: Martínez-Rubio et al. (2019), Hossain et al. (2021)
  • Fairness Research: Joseph et al. (2016, 2018), Patil et al. (2021)

Nash Social Welfare

  • Theoretical Foundations: Kaneko & Nakamura (1979)
  • Computational Methods: Cole & Gkatzelis (2015)
  • MA-MAB Applications: Jones, Nguyen & Nguyen (2023)—the work directly extended by this paper

Probing Mechanisms

  • Economic Origins: Weitzman (1979)
  • Single-Agent Bandits with Probing:
    • Aaron D. Tucker et al. (2023): Expensive reward observations, Θ(c^{1/3}T^{2/3}) bounds
    • Elumar et al. (2024): One arm probed per round, Õ(√KT) regret
    • Zuo et al. (2020): Observe-Before-Play framework
  • Submodular Probing: Bhaskara et al. (2020) routing problems
  • This Paper's Contribution: First to combine probing with multi-agent fairness

Research Gap

Existing work either addresses fair MAB relying on passive feedback or studies probing while ignoring fairness. This paper fills this gap.

Theoretical Analysis

Offline Approximation Guarantee (Theorem 1)

Result: R(S_pr) ≥ (e-1)/(2e-1) · 1/ζ · R(S*)

Proof Key Steps:

  1. Leverage Lemma 5: R(S) ≤ (1-α(|S|))(g(S) + h(S))
  2. Apply (1-1/e) approximation for submodular maximization
  3. Connect f_upper to R through algorithm filtering conditions
  4. Derive constant-factor approximation guarantee

Online Regret Bound (Theorem 2)

Result: With probability at least 1-δ, Rregret(T)=O(ζ(MAT+MA)lnc(MATδ))R_{\text{regret}}(T) = O\left(\zeta(\sqrt{MAT} + MA) \ln^c\left(\frac{MAT}{\delta}\right)\right)

Proof Framework:

  1. NSW Smoothness (Lemma 6): Lipschitz property of NSW with respect to reward matrix
  2. Concentration Bounds (Lemma 7): Confidence bounds based on Freedman's inequality
  3. Round Stratification (Lemma 8):
    • Define γ_{j,t} = Σ_a π_{j,a,t}(1-U_{j,a,t})
    • Large γ rounds: |Q(t,p)| ≥ 2^p·3lnT, contribute ≤1/T²
    • Small γ rounds: Apply concentration bounds, decompose into square-root and linear terms
  4. Square-Root Term Bound: Handle via Young's inequality and Lemma 8
  5. Linear Term Bound: Utilize Lemma 4.4 from Jones et al. (2023)
  6. Final Combination: Obtain O(√MAT) main term plus lower-order terms

Key Techniques:

  • Convert regret from (S*,π*) to (S_t,π_t), utilizing Theorem 1's factor
  • UCB optimism: U_{j,a,t}≥μ_{j,a} ensures exploration
  • Stratified analysis avoids directly handling complex dependencies across all rounds

Conclusions and Discussion

Main Conclusions

  1. Theoretical Contributions:
    • Offline: Computable algorithm with constant-factor approximation guarantee
    • Online: Sublinear regret O(√MAT), comparable to non-probing fair algorithms but superior practical performance
  2. Practical Value:
    • Probing mechanism significantly improves reward estimation
    • Balances exploration-exploitation and fairness
    • Validated effectiveness on real ride-sharing data
  3. Methodological Innovation:
    • First to combine active probing with multi-agent fairness
    • Submodular approximation technique for non-convex optimization
    • Online learning framework fusing UCB and NSW

Limitations

  1. Computational Complexity:
    • Offline algorithm requires iterative NSW optimization (NP-hard)
    • Each online round requires computing optimal policy π_t
    • Large-scale scenarios (M,A very large) may face computational bottlenecks
  2. Theoretical Assumptions:
    • Piecewise linear approximation requires sufficiently fine partitioning (assumption d/d'≥τ_i/τ_j)
    • Probing budget I must be set in advance
    • Cost function α(·) form must be known
  3. Experimental Scope:
    • Relatively limited experimental scale (maximum M=20, A=10)
    • Real data tested only on ride-sharing scenario
    • Limited comparison with more recent fair MAB methods
  4. Fairness Metric:
    • Only considers NSW, unexplored other fairness concepts (e.g., max-min, envy-freeness)
    • NSW sensitive to zero rewards (requires all agents receive positive rewards)

Future Directions

While not explicitly stated, inferrable directions include:

  1. Extension to Other Fairness Metrics: Investigate probing mechanisms combined with alternative fairness concepts
  2. Adaptive Probing Budget: Dynamically adjust probing quantity rather than fixed I
  3. Distributed Implementation: Decentralized probing and allocation decisions
  4. Non-Stationary Environments: Scenarios where reward distributions change over time
  5. Constraints: Incorporate practical constraints like budgets and delays

In-Depth Evaluation

Strengths

  1. Theoretical Rigor:
    • Strict theoretical guarantees in both offline and online settings
    • Complete and detailed proofs (appendix exceeds 20 pages)
    • Clever and essential establishment of submodularity properties
  2. Problem Importance:
    • Addresses genuine pain points in practical applications (fairness and uncertainty)
    • Ride-sharing case study has direct application value
    • Fills research gap between fair MAB and active exploration
  3. Methodological Innovation:
    • Novel combination of probing mechanism and NSW
    • Clever piecewise linear upper bound construction technique
    • Non-trivial fusion of UCB and NSW optimization
  4. Experimental Sufficiency:
    • Covers synthetic and real data
    • Multiple distribution and scale settings
    • Complete scalability analysis
    • Clear ablation studies (via Greedy P+A comparison)
  5. Writing Clarity:
    • Clear motivation articulation
    • Complete technical detail description
    • Intuitive and effective figures

Weaknesses

  1. Computational Efficiency Insufficiently Discussed:
    • No runtime reporting
    • Missing computational complexity analysis of NSW optimization
    • Large-scale scenario feasibility questionable
  2. Simplified Probing Cost Modeling:
    • α(|S|) function form too abstract
    • Insufficient detail on setting α in practical applications
    • Unexplored cost characteristic differences across applications
  3. Limited Baseline Methods:
    • Primarily compares against Jones et al. (2023)
    • Limited comparison with more recent fair MAB algorithms
    • Missing comparison with non-fair but efficient probing algorithms
  4. Limited Experimental Scale:
    • Maximum M=20, A=10 relatively small
    • Only one real dataset
    • Untested extreme imbalance scenarios (M>>A or A>>M)
  5. Theory-Practice Gap:
    • Theorem 1's approximation factor includes ζ parameter, actual selection unclear
    • Theorem 2's regret bound constant c not explicit
    • Piecewise linear approximation's actual error (0.03) impact on performance unanalyzed
  6. Insufficient Fairness Discussion:
    • NSW limitations (sensitivity to zero rewards) insufficiently discussed
    • No comparison with alternative fairness metrics
    • Individual rationality not considered

Impact

Academic Impact:

  • High: Fills important research gap, expected high citation rate
  • Provides new research paradigm for fair MAB field
  • Probing mechanism combined with fairness may inspire follow-up work

Practical Value:

  • Medium-High:
    • Direct application potential for ride-sharing and similar platforms
    • Computational complexity may limit large-scale deployment
    • Requires further engineering optimization

Reproducibility:

  • High: Code open-sourced, algorithms described in detail
  • Complete theoretical proofs easily verifiable
  • Clear experimental setup

Applicable Scenarios

Strong Applicability:

  1. Ride-sharing Dispatch: Medium-scale cities (10-20 regions, 10-50 vehicles)
  2. Content Recommendation: Platforms balancing creator exposure
  3. Wireless Network Scheduling: 60 GHz WLAN scenarios (mentioned in paper)
  4. Resource Allocation: Any scenario requiring fairness with uncertainty

Weak Applicability:

  1. Large-Scale Systems: M,A>100 may face computational infeasibility
  2. Real-Time Systems: NSW optimization computational delay may be prohibitive
  3. Dynamic Environments: Rapidly changing reward distributions
  4. Zero-Sum Competition: Direct competition among agents

Inapplicable Scenarios:

  1. Single-Agent Problems: Method overly complex
  2. Known Rewards: Probing mechanism unnecessary
  3. Extreme Fairness Requirements: NSW may be insufficiently "fair" (consider max-min)

Key References

  1. Jones, Nguyen & Nguyen (2023): "An efficient algorithm for fair multi-agent multi-armed bandit with low regret" - Foundational work directly extended by this paper
  2. Cole & Gkatzelis (2015): "Approximating the Nash social welfare with indivisible items" - Theoretical foundation for NSW optimization
  3. Zuo, Zhang & Joe-Wong (2020): "Observe Before Play: Multi-Armed Bandit with Pre-Observations" - Pioneer work on probing mechanisms
  4. Bhaskara et al. (2020): "Adaptive probing policies for shortest path routing" - Submodular probing application
  5. Caragiannis et al. (2019): "The unreasonable fairness of maximum Nash welfare" - Theoretical fairness of NSW

Summary

This paper makes important contributions to the multi-agent multi-armed bandit field by systematically combining active probing mechanisms with fairness objectives for the first time. Theoretically rigorous (constant-factor approximation and sublinear regret), experimentally comprehensive (synthetic and real data), and methodologically innovative (submodular approximation and UCB-NSW fusion). Primary limitations lie in computational complexity and experimental scale, but these do not diminish its academic value and practical potential. This work opens new research directions in fair machine learning and online decision-making, warranting further exploration and application.