2025-11-27T07:22:18.939551

Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Xu, Liu, Mattei et al.

We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.

academic

Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Basic Information

Paper ID: 2506.14988
Title: Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits
Authors: Tianyi Xu, Jiaxin Liu, Nicholas Mattei, Zizhan Zheng
Institutions: Tulane University, University of Illinois Urbana-Champaign
Classification: cs.LG, cs.AI
Publication Date: arXiv preprint, version from November 26, 2025
Paper Link: https://arxiv.org/abs/2506.14988v4

Abstract

This paper proposes a multi-agent multi-armed bandit (MA-MAB) framework designed to ensure fairness among agents while maximizing overall system performance. To address decision-making challenges with limited arm reward information, a novel probing mechanism is introduced that strategically collects information about selected arms before allocation. In the offline setting (known reward distributions), a greedy probing algorithm with constant-factor approximation guarantees is designed by leveraging submodularity. In the online setting, an algorithm is developed that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets demonstrate that the approach outperforms baseline methods in both fairness and efficiency.

Research Background and Motivation

Core Problem

Traditional multi-armed bandit problems typically optimize utilitarian welfare (i.e., sum of total rewards), but this leads to severe fairness issues. For example, in ride-sharing scenarios, when a central dispatch system allocates drivers to different geographic regions, optimizing total revenue may result in some drivers receiving no orders at all, causing "starvation" phenomena.

Problem Importance

Fair resource allocation is crucial in many practical applications:

Ride-sharing Platforms: Drivers need fair access to profitable regions
Content Recommendation Systems: Creator exposure should not be monopolized
Wireless Network Scheduling: Client devices require fair bandwidth allocation

Limitations of Existing Methods

Fairness Deficiency: Existing MA-MAB methods primarily focus on total reward maximization, neglecting individual fairness
Information Dependency: Relies on immediate feedback to update estimates, performing poorly in incomplete information or noisy environments
Insufficient Exploration: Lacks active information collection mechanisms, leading to error propagation from uncertain arms to allocation decisions

Research Motivation

This paper bridges the gap between fair multi-agent approaches and active exploration strategies by introducing a probing mechanism and Nash Social Welfare (NSW) objectives, enabling information acquired through active exploration to directly translate into fair outcomes for all agents.

Core Contributions

Novel Framework: Extends the MA-MAB framework by introducing a probing mechanism to test selected arms before allocation, ensuring fairness through NSW optimization
Offline Algorithm: Develops a simple yet effective greedy probing strategy with provable performance guarantees for known reward patterns
Online Algorithm: Proposes an algorithm balancing exploration and fairness, proving that probing and fair allocation do not compromise asymptotic performance (sublinear regret)
Experimental Validation: Demonstrates superiority over baselines in both fairness and efficiency on synthetic and real data

Methodology Details

Task Definition

Basic Setup:

Agents: M agents, denoted j ∈ M
Arms: A arms, denoted a ∈ A
Reward Distributions: Each agent-arm pair (j,a) has an unknown distribution D_{j,a} with mean μ_{j,a} ∈ 0,1

Decision Process (each round t):

Probing Phase: Select probing set S_t ⊆ A, obtaining fresh reward samples R_{j,a,t} for each a ∈ S_t
Allocation Phase: Based on observed rewards and current estimates, allocate arms to agents via policy π_t

Fairness Objective: Maximize Nash Social Welfare (NSW) $\text{NSW}(S_t, R_t, \mu, \pi_t) = \prod_{j \in [M]} \left( \sum_{a \in S_t} \pi_{j,a,t} R_{j,a,t} + \sum_{a \notin S_t} \pi_{j,a,t} \mu_{j,a} \right)$

Probing Cost: Effective reward is $R^{\text{total}}_t = (1 - \alpha(|S_t|)) \mathbb{E}_{R_t}[\text{NSW}(S_t, R_t, \mu, \pi_t)]$ where α(·) is a non-decreasing cost function with α(0)=0, α(I)=1

Regret Definition: $R_{\text{regret}}(T) = \sum_{t=1}^T \left[ (1-\alpha(|S^*_t|)) \mathbb{E}[\text{NSW}(S^*_t, R_t, \mu, \pi^*_t)] - R^{\text{total}}_t \right]$

Offline Setting: Greedy Probing Algorithm

Optimization Objective Decomposition

Define two utility functions:

Probing Utility g(S): Optimal NSW expectation using only probed arms $g(S) = \max_{\pi \in \Delta^A_S} \mathbb{E}\left[ \prod_{j \in [M]} \left( \sum_{a \in S} \pi_{j,a} R_{j,a} \right) \right]$
Non-Probing Utility h(S): Optimal NSW using only unprobed arms $h(S) = \max_{\pi \in \Delta^A_{[A]\backslash S}} \prod_{j \in [M]} \left( \sum_{a \notin S} \pi_{j,a} \mu_{j,a} \right)$

Piecewise Linear Upper Bound Construction

Since log(g(S)) is non-submodular, direct optimization is difficult. A piecewise linear envelope approximation is adopted:

Partition interval 0, x_max into fine-grained segments with breakpoints τ_0, τ_1, ..., τ_L
Construct tangent lines at each breakpoint: T_{τ_i}(z) = log(τ_i) + (z - τ_i)/τ_i
Define φ(z) = max_{0≤i<L} T_{τ_i}(z), satisfying φ(z) ≥ log(z)
Set f_upper(S) = φ(g(S))

Submodularity Properties

Lemmas 1-5 establish key properties:

Lemma 1: g(S) is monotone increasing
Lemma 2: f_upper(S) is monotone increasing
Lemma 3: f_upper(S) is submodular (core property)
Lemma 4: h(S) is monotone decreasing
Lemma 5: R(S) ≤ (1-α(|S|))(g(S) + h(S))

Algorithm 1: Offline Greedy Probing

Input: Distributions {F_{m,a}}, cost function α(·), budget I, parameter ζ≥1
Output: Probing set S_pr

1. Initialize S_0 = ∅
2. For i = 1 to I:
   - Select arm with maximum marginal gain:
     a = argmax_{a∉S_{i-1}} [f_upper(S_{i-1}∪{a}) - f_upper(S_{i-1})]
   - S_i = S_{i-1} ∪ {a}
3. Sort candidate sets by (1-α(|S_j|))f_upper(S_j) in descending order
4. Iterate through sorted sets:
   - If (1-α(|S_j|))f_upper(S_j) < h(∅), return ∅
   - If (1-α(|S_j|))f_upper(S_j) > ζR(S_j), continue
   - Otherwise return S_j

Theorem 1 (Approximation Guarantee): For any ζ≥1, the returned S_pr satisfies $R(S_{pr}) \geq \frac{e-1}{2e-1} \cdot \frac{1}{\zeta} \cdot R(S^*)$ This is derived from the (1-1/e) approximation factor for submodular maximization.

Online Setting: OFMUP Algorithm

Algorithm 2: Online Fair Multi-Agent UCB with Probing (OFMUP)

Initialization Phase (t = 1 to MA):

Each agent-arm pair is explored at least once
Construct empirical CDF F̂_{j,a,t} and mean estimate μ̂_{j,a,t}

Main Loop Phase (t > MA):

Probing Set Selection: Call Algorithm 1 based on current estimates F̂_{j,a,t} to select S_t
Probing Update: Sample arms in S_t, update statistics and confidence upper bounds $U_{j,a,t} = \min(\hat{\mu}_{j,a,t} + w_{j,a,t}, 1)$ where w_{j,a,t} is confidence width based on Freedman's inequality
Policy Optimization: $\pi_t = \arg\max_{\pi_t \in \Delta^A} (1-\alpha(|S_t|)) \mathbb{E}_{R_t}[\text{NSW}(S_t, R_t, U_t, \pi_t)]$
Arm Pulling: Each agent j pulls arms according to π_t, observes rewards and updates

Confidence Bound Construction

Lemma 7 (Concentration Bound): With probability at least 1-δ/2, for all t>A, a∈A, j∈M: $|\mu_{j,a} - \hat{\mu}_{j,a,t}| \leq \sqrt{\frac{2(\hat{\mu}_{j,a,t} - \hat{\mu}^2_{j,a,t}) \ln(\frac{2MAT}{\delta})}{N_{j,a,t}}} + \frac{\ln(\frac{2MAT}{\delta})}{3N_{j,a,t}} = w_{j,a,t}$

Technical Innovations

Probing and Fairness Integration: First to combine active probing mechanisms with NSW fairness objectives, unlike prior work optimizing only total rewards
Submodular Approximation Technique: Transforms non-submodular problems into submodular optimization via piecewise linear upper bounds, maintaining tractability
UCB-NSW Fusion: Online algorithm cleverly combines UCB confidence bounds with NSW optimization, balancing exploration-exploitation and fairness
Hierarchical Regret Analysis: Partitions rounds into "large γ" and "small γ" categories, handling high and low uncertainty cases separately

Experimental Setup

Datasets

Synthetic Data:

Small Scale: M=12 agents, A=8 arms
Medium Scale: M=20 agents, A=10 arms
Reward Distributions:
- Bernoulli distribution (rewards 0 or 1, means in 0.3, 0.8)
- Discrete distribution (rewards sampled from {0.3, 0.4, 0.5, 0.6, 0.7, 0.8})

Real Data: NYYellowTaxi 2016 dataset

Agents: Vehicles (randomly pre-sampled locations)
Arms: Discretized pickup locations (0.01°×0.01° grid)
Rewards: Based on normalized Manhattan distance from vehicle to pickup point (closer distance yields higher reward)

Evaluation Metrics

Cumulative Regret: Computed per formula (1), optimal reward determined via exhaustive search
Numerical Stability: Uses geometric mean to aggregate cumulative NSW of each agent's rewards
Approximation Verification: Verifies difference between f_upper(S) and log(g(S)) ≤ 0.03

Comparison Methods

Non-Probing: Fair MAB algorithm from Jones et al. (2023), without probing capability, optimizes allocation based only on current information
Random P+A: Randomly selects fixed number of arms to probe, then randomly allocates
Greedy P+A: Uses same greedy probing strategy, but randomly allocates after probing

Implementation Details

Probing Budget: Set according to problem scale
Cost Function: α(|S|) is increasing function with α(0)=0, α(I)=1
Confidence Parameter: δ set to ensure high-probability guarantees
Open Source Code: https://github.com/jiaxin26/Fair-MA-MAB-with-Probing

Experimental Results

Main Findings

Key Observations from Figure 1:

Small Scale Scenario (M=12, A=8, Bernoulli):
- OFMUP achieves 85% lower regret than Random P+A at T=3000
- 60% lower than Greedy P+A
- Significantly outperforms Non-Probing
Medium Scale Scenario (M=20, A=10, Bernoulli):
- OFMUP advantage more pronounced
- 88% lower regret than Random P+A at T=3000
- 80% lower than Greedy P+A
- Demonstrates better scalability
Discrete Distribution Scenario:
- OFMUP consistently outperforms baselines
- Gap increases as learning reward patterns
- 85% lower than Random P+A, 65% lower than Non-Probing at T=3000
Random P+A Anomaly:
- Slightly higher regret than expected in small-scale tests
- Cause: Increased variance in small samples when computing fairness via geometric mean

Scalability Analysis

Scalability Shown in Figure 2:

Fixed Arms (A=8), Varying Agents:
- OFMUP performs well across all agent counts
- Relative advantage grows with problem complexity
Fixed Agents (M=20), Varying Arms:
- OFMUP advantage more pronounced with increasing arms
- Demonstrates probing mechanism's value in high-dimensional spaces

Experimental Findings

Critical Role of Probing: Probing plays decisive role in information gathering and guiding allocation
Importance of Fair Allocation: Greedy P+A performs far worse than OFMUP, highlighting importance of fair allocation after probing
Theory Validation: Experimental results verify theoretical analysis—OFMUP effectively balances exploration-exploitation and fairness
Real Data Applicability: Success on NYYellowTaxi data demonstrates practical application potential

Multi-Armed Bandit Foundations

Single-Agent MAB: Lai & Robbins (1985), Garivier & Cappé (2011)
Multi-Agent MAB: Martínez-Rubio et al. (2019), Hossain et al. (2021)
Fairness Research: Joseph et al. (2016, 2018), Patil et al. (2021)

Theoretical Foundations: Kaneko & Nakamura (1979)
Computational Methods: Cole & Gkatzelis (2015)
MA-MAB Applications: Jones, Nguyen & Nguyen (2023)—the work directly extended by this paper

Probing Mechanisms

Economic Origins: Weitzman (1979)
Single-Agent Bandits with Probing:
- Aaron D. Tucker et al. (2023): Expensive reward observations, Θ(c^{1/3}T^{2/3}) bounds
- Elumar et al. (2024): One arm probed per round, Õ(√KT) regret
- Zuo et al. (2020): Observe-Before-Play framework
Submodular Probing: Bhaskara et al. (2020) routing problems
This Paper's Contribution: First to combine probing with multi-agent fairness

Research Gap

Existing work either addresses fair MAB relying on passive feedback or studies probing while ignoring fairness. This paper fills this gap.

Theoretical Analysis

Offline Approximation Guarantee (Theorem 1)

Result: R(S_pr) ≥ (e-1)/(2e-1) · 1/ζ · R(S*)

Proof Key Steps:

Leverage Lemma 5: R(S) ≤ (1-α(|S|))(g(S) + h(S))
Apply (1-1/e) approximation for submodular maximization
Connect f_upper to R through algorithm filtering conditions
Derive constant-factor approximation guarantee

Online Regret Bound (Theorem 2)

Result: With probability at least 1-δ, $R_{\text{regret}}(T) = O\left(\zeta(\sqrt{MAT} + MA) \ln^c\left(\frac{MAT}{\delta}\right)\right)$

Proof Framework:

NSW Smoothness (Lemma 6): Lipschitz property of NSW with respect to reward matrix
Concentration Bounds (Lemma 7): Confidence bounds based on Freedman's inequality
Round Stratification (Lemma 8):
- Define γ_{j,t} = Σ_a π_{j,a,t}(1-U_{j,a,t})
- Large γ rounds: |Q(t,p)| ≥ 2^p·3lnT, contribute ≤1/T²
- Small γ rounds: Apply concentration bounds, decompose into square-root and linear terms
Square-Root Term Bound: Handle via Young's inequality and Lemma 8
Linear Term Bound: Utilize Lemma 4.4 from Jones et al. (2023)
Final Combination: Obtain O(√MAT) main term plus lower-order terms

Key Techniques:

Convert regret from (S*,π*) to (S_t,π_t), utilizing Theorem 1's factor
UCB optimism: U_{j,a,t}≥μ_{j,a} ensures exploration
Stratified analysis avoids directly handling complex dependencies across all rounds

Conclusions and Discussion

Main Conclusions

Theoretical Contributions:
- Offline: Computable algorithm with constant-factor approximation guarantee
- Online: Sublinear regret O(√MAT), comparable to non-probing fair algorithms but superior practical performance
Practical Value:
- Probing mechanism significantly improves reward estimation
- Balances exploration-exploitation and fairness
- Validated effectiveness on real ride-sharing data
Methodological Innovation:
- First to combine active probing with multi-agent fairness
- Submodular approximation technique for non-convex optimization
- Online learning framework fusing UCB and NSW

Limitations

Computational Complexity:
- Offline algorithm requires iterative NSW optimization (NP-hard)
- Each online round requires computing optimal policy π_t
- Large-scale scenarios (M,A very large) may face computational bottlenecks
Theoretical Assumptions:
- Piecewise linear approximation requires sufficiently fine partitioning (assumption d/d'≥τ_i/τ_j)
- Probing budget I must be set in advance
- Cost function α(·) form must be known
Experimental Scope:
- Relatively limited experimental scale (maximum M=20, A=10)
- Real data tested only on ride-sharing scenario
- Limited comparison with more recent fair MAB methods
Fairness Metric:
- Only considers NSW, unexplored other fairness concepts (e.g., max-min, envy-freeness)
- NSW sensitive to zero rewards (requires all agents receive positive rewards)

Future Directions

While not explicitly stated, inferrable directions include:

Extension to Other Fairness Metrics: Investigate probing mechanisms combined with alternative fairness concepts
Adaptive Probing Budget: Dynamically adjust probing quantity rather than fixed I
Distributed Implementation: Decentralized probing and allocation decisions
Non-Stationary Environments: Scenarios where reward distributions change over time
Constraints: Incorporate practical constraints like budgets and delays

In-Depth Evaluation

Strengths

Theoretical Rigor:
- Strict theoretical guarantees in both offline and online settings
- Complete and detailed proofs (appendix exceeds 20 pages)
- Clever and essential establishment of submodularity properties
Problem Importance:
- Addresses genuine pain points in practical applications (fairness and uncertainty)
- Ride-sharing case study has direct application value
- Fills research gap between fair MAB and active exploration
Methodological Innovation:
- Novel combination of probing mechanism and NSW
- Clever piecewise linear upper bound construction technique
- Non-trivial fusion of UCB and NSW optimization
Experimental Sufficiency:
- Covers synthetic and real data
- Multiple distribution and scale settings
- Complete scalability analysis
- Clear ablation studies (via Greedy P+A comparison)
Writing Clarity:
- Clear motivation articulation
- Complete technical detail description
- Intuitive and effective figures

Weaknesses

Computational Efficiency Insufficiently Discussed:
- No runtime reporting
- Missing computational complexity analysis of NSW optimization
- Large-scale scenario feasibility questionable
Simplified Probing Cost Modeling:
- α(|S|) function form too abstract
- Insufficient detail on setting α in practical applications
- Unexplored cost characteristic differences across applications
Limited Baseline Methods:
- Primarily compares against Jones et al. (2023)
- Limited comparison with more recent fair MAB algorithms
- Missing comparison with non-fair but efficient probing algorithms
Limited Experimental Scale:
- Maximum M=20, A=10 relatively small
- Only one real dataset
- Untested extreme imbalance scenarios (M>>A or A>>M)
Theory-Practice Gap:
- Theorem 1's approximation factor includes ζ parameter, actual selection unclear
- Theorem 2's regret bound constant c not explicit
- Piecewise linear approximation's actual error (0.03) impact on performance unanalyzed
Insufficient Fairness Discussion:
- NSW limitations (sensitivity to zero rewards) insufficiently discussed
- No comparison with alternative fairness metrics
- Individual rationality not considered

Impact

Academic Impact:

High: Fills important research gap, expected high citation rate
Provides new research paradigm for fair MAB field
Probing mechanism combined with fairness may inspire follow-up work

Practical Value:

Medium-High:
- Direct application potential for ride-sharing and similar platforms
- Computational complexity may limit large-scale deployment
- Requires further engineering optimization

Reproducibility:

High: Code open-sourced, algorithms described in detail
Complete theoretical proofs easily verifiable
Clear experimental setup

Applicable Scenarios

Strong Applicability:

Ride-sharing Dispatch: Medium-scale cities (10-20 regions, 10-50 vehicles)
Content Recommendation: Platforms balancing creator exposure
Wireless Network Scheduling: 60 GHz WLAN scenarios (mentioned in paper)
Resource Allocation: Any scenario requiring fairness with uncertainty

Weak Applicability:

Large-Scale Systems: M,A>100 may face computational infeasibility
Real-Time Systems: NSW optimization computational delay may be prohibitive
Dynamic Environments: Rapidly changing reward distributions
Zero-Sum Competition: Direct competition among agents

Inapplicable Scenarios:

Single-Agent Problems: Method overly complex
Known Rewards: Probing mechanism unnecessary
Extreme Fairness Requirements: NSW may be insufficiently "fair" (consider max-min)

Key References

Jones, Nguyen & Nguyen (2023): "An efficient algorithm for fair multi-agent multi-armed bandit with low regret" - Foundational work directly extended by this paper
Cole & Gkatzelis (2015): "Approximating the Nash social welfare with indivisible items" - Theoretical foundation for NSW optimization
Zuo, Zhang & Joe-Wong (2020): "Observe Before Play: Multi-Armed Bandit with Pre-Observations" - Pioneer work on probing mechanisms
Bhaskara et al. (2020): "Adaptive probing policies for shortest path routing" - Submodular probing application
Caragiannis et al. (2019): "The unreasonable fairness of maximum Nash welfare" - Theoretical fairness of NSW

Summary

This paper makes important contributions to the multi-agent multi-armed bandit field by systematically combining active probing mechanisms with fairness objectives for the first time. Theoretically rigorous (constant-factor approximation and sublinear regret), experimentally comprehensive (synthetic and real data), and methodologically innovative (submodular approximation and UCB-NSW fusion). Primary limitations lie in computational complexity and experimental scale, but these do not diminish its academic value and practical potential. This work opens new research directions in fair machine learning and online decision-making, warranting further exploration and application.