2025-11-18T08:49:13.884110

Perturbing Best Responses in Zero-Sum Games

Dziwoki, Horcik
This paper investigates the impact of perturbations on the best-response-based algorithms approximating Nash equilibria in zero-sum games, namely Double Oracle and Fictitious Play. More precisely, we assume that the oracle computing the best responses perturbs the utilities before selecting the best response. We show that using such an oracle reduces the number of iterations for both algorithms. For some cases, suitable perturbations ensure the expected number of iterations is logarithmic. Although the utility perturbation is computationally demanding as it requires iterating through all pure strategies, we demonstrate that one can efficiently perturb the utilities in games where pure strategies have further inner structure.
academic

Perturbing Best Responses in Zero-Sum Games

Basic Information

Abstract

This paper investigates the effects of perturbations on best-response-based algorithms used to approximate Nash equilibria in zero-sum games, with particular focus on Double Oracle and Fictitious Play algorithms. The research assumes that the oracle computing best responses perturbs utilities before selecting the best response. The study demonstrates that using such a perturbed oracle can reduce the number of iterations required by both algorithms. In certain cases, appropriate perturbations can ensure that the expected number of iterations is logarithmic. Although utility perturbation requires traversing all pure strategies with high computational cost, the research proves that for games where pure strategies have internal structure (such as partially observable stochastic games), perturbations can be implemented efficiently.

Research Background and Motivation

Core Problem

Computing Nash equilibria in two-player zero-sum games with large strategy spaces is computationally intensive. Although it is theoretically known that ε-Nash equilibria of size O(log n) exist (where n is the number of pure strategies), traditional best-response oracle (BRO)-based algorithms such as Fictitious Play (FP) and Double Oracle (DO) require Ω(n) iterations in the worst case.

Significance

  1. Theory-Practice Gap: Althöfer and Lipton et al. proved the existence of logarithmic-sized ε-NE, but practical algorithms cannot achieve this complexity
  2. Lower Bound Constraints: Hazan and Koren proved that any BRO-based algorithm requires at least Ω(√n/log³n) iterations
  3. Practical Application Needs: Deep reinforcement learning algorithms (such as Policy Space Response Oracles) rely on these foundational algorithms

Limitations of Existing Methods

  1. FP and DO: Require O(n) iterations in the worst case
  2. Hazan-Koren Algorithm: While providing O(√n/ε²) complexity, the algorithm is complex and offers only square-root improvement
  3. Regularization Methods: Such as PU and OMWU achieve O(log n) iterations but require maintaining probability distributions over all pure strategies, unsuitable for large-scale games

Research Motivation

By introducing perturbations in the best-response oracle to make it more powerful, the research aims to break through the theoretical lower bound limitations and achieve logarithmic convergence speed.

Core Contributions

  1. Introduction of Perturbed Best Response Oracle (PBRO): Proposes a mechanism that randomly perturbs utilities before computing best responses
  2. Theoretical Results:
    • Proves that Stochastic Fictitious Play (SFP) achieves O(log n/ε²) expected iteration complexity
    • Proves that Stochastic Double Oracle (SDO) achieves O(log n) expected iterations for specific difficult instances (Zhang and Sandholm's example)
  3. Efficient Implementation Scheme: Proposes efficient perturbation methods for games with internal structure (such as POSG, path planning games) without requiring traversal of all pure strategies
  4. Experimental Validation: Verifies the effectiveness of perturbation methods on multiple game types, including matrix games, stochastic games, and path planning games
  5. Theoretical Tools: Utilizes the Gumbel-max trick to establish connections between SFP and stochastic exponential weighted forecasting (REWF)

Method Details

Task Definition

Input: Matrix game M ∈ ℝ^(m×n), precision parameter ε > 0 Output: ε-Nash equilibrium strategy pair ⟨p*, q*⟩ Constraint: Minimize the number of iterations (oracle calls)

Mathematical Definition of Perturbed Best Response

For the row player, given column player's mixed strategy q ∈ Δ_n:

B̃R_r(q) ∈ argmin_{i∈[m]} π_i(Mq - u)

where u is a random vector with i.i.d. random variable components.

For the column player, given row player's mixed strategy p ∈ Δ_m:

B̃R_c(p) ∈ argmax_{j∈[n]} π_j(p^⊤M + v)

Stochastic Fictitious Play (SFP)

Algorithm Flow:

  1. Initialize: t←1, p←e_k, q←e_l (initial pure strategies)
  2. Compute termination boundary: lb←BRVal_r(q), ub←BRVal_c(p)
  3. While ub - lb > tε:
    • t←t+1
    • i←B̃R_r(q), j←B̃R_c(p) (perturbed best response)
    • p←p+e_i, q←q+e_j (accumulate strategies)
    • Update boundaries
  4. Return average strategy: ⟨p*/t, q*/t⟩

Key Innovations:

  • Uses perturbed oracle instead of deterministic oracle
  • Maintains cumulative strategy vectors, returns average strategy at the end
  • Termination condition uses unperturbed best response values

Theoretical Foundation of Gumbel Perturbation

Gumbel-max Trick (Lemma 1): For vector x ∈ ℝ^n, if z has components independently distributed as Gumbel distribution G(0,β):

P(argmax_i π_i(x+z) = i*) = π_i*(softmax(x/β))

This means that using Gumbel-perturbed best response is equivalent to sampling from a softmax distribution.

Connection to REWF:

  • Strategy selection by the row player in SFP is equivalent to REWF strategy
  • In round t, samples from softmax(-η∑^{t-1} Me)
  • Parameter η = 1/β controls the exploration-exploitation tradeoff

Core Theoretical Analysis

Theorem 3 (SFP Complexity): For matrix game M ∈ ℝ^(m×n) normalized to 0,1, m ≤ n, setting β = (2+√(2ln n))/(ε√(8ln n)), SFP finds ε-NE within O(log n/ε²) expected iterations.

Proof Sketch:

  1. Set iteration count T = ((2+√(2ln n))/ε)² ∈ O(log n/ε²)
  2. Using Corollary 2 (REWF regret bound), with probability ≥3/4:
    ∑_{t=1}^T M(i_t,j_t) - min_i ∑_{t=1}^T M(i,j_t) ≤ √T(1 + √(ln n)/2)
    
  3. Apply dual result for column player (probability ≥3/4)
  4. Both events occur simultaneously with probability ≥1/2
  5. Combining both inequalities yields ub - lb ≤ ε
  6. Expected running time ≤2

Stochastic Double Oracle (SDO)

Algorithm Characteristics:

  • Maintains strategy subsets R ⊆ m, C ⊆ n
  • Each round computes Nash equilibrium of subgame MR,C
  • Uses perturbed oracle to add new strategies

Analysis for Specific Games:

Example 1 (Matrix L):

L = [0   1   ...  1  ]
    [-1  0   ...  1  ]
    [⋮   ⋮   ⋱   ⋮  ]
    [-1  -1  ... 0  ]

Lemma 4 (Expected Value of Uniform Perturbation): For the k-th row x = ⟨1,...,1,0,-1,...,-1⟩, using U(-1/2,1/2) perturbation, the expected index of perturbed best response EI = k/2

Theorem 6: SDO finds NE of L within O(log n) expected iterations

Proof Technique:

  • Define potential function X_t = max{r_t, c_t}, where r_t = min R_t, c_t = min C_t
  • Use drift theory to prove X_t - EX_{t+2} ≥ X_t/2
  • Apply Corollary 17 to obtain ET ≤ 2ln n

Efficient Perturbation Implementation

Clustering Method: For games with internal structure (such as stochastic games):

  1. Identify matrix elements corresponding to the same terminal state
  2. Cluster matrix elements into K clusters
  3. Apply the same random perturbation value to each cluster

Perturbation Formula:

B̃R_r(q) = argmin_{i∈[m]} π_i((L + ∑_{k=1}^K z_k B_k)q)

where B_k is a mask matrix identifying elements in the k-th cluster

Advantages:

  • Only K random numbers need to be generated (K << mn)
  • Preserves semantic structure of the game
  • Applicable to structured games like POSG and EFG

Experimental Setup

Datasets and Game Types

  1. Random Matrix Games: n×n matrices with elements uniformly sampled from 0,1, 100 instances per size
  2. Matrix L and U^⊤: Difficult instances from Examples 1 and 2
  3. f-finger Morra Game: Classic game with matrix size f²×f²
  4. Colonel Blotto Game: 5 battlefields with different unit budgets
  5. n-bit Random Stochastic Games: Corresponding to 2^n×2^n matrices with tree structure
  6. Path Planning Games: n×n grid where one player seeks shortest path and the other chooses edges to increase cost

Evaluation Metrics

  • Iteration Count: Number of oracle calls needed to reach ε-NE
  • ε Value: Set to 0.1
  • Statistics: Mean and standard deviation over 10 repeated experiments

Comparison Methods

  1. FP: Standard Fictitious Play
  2. AFP: Anticipatory Fictitious Play
  3. SAFP: Perturbed version of AFP
  4. DO: Standard Double Oracle
  5. SDO: Perturbed version of Double Oracle

Implementation Details

  • Programming Language: Python
  • Hardware: AMD Ryzen 7 PRO 7840U
  • Random Number Generation: numpy library, initial seed 1
  • Initialization: Worst-case index (k=l=n)
  • Tie-breaking: Select best response with minimum index
  • SFP β Parameter: Set according to Theorem 3
  • SDO Perturbation Distribution:
    • Theoretical analysis: U(-1,1)
    • Practical application: U(-0.01,0.01) and U(-0.001,0.001)

Experimental Results

Main Results

SFP Performance on Different Games

Random 0,1 Matrix Games:

  • AFP performs best, significantly outperforming other methods
  • SFP and SAFP show similar performance, slightly better than FP
  • Perturbation advantage not obvious in random games
  • Iteration count grows nearly linearly with scale

Matrix U^⊤:

  • FP and AFP show O(n) worst-case complexity
  • SFP and SAFP significantly reduce iteration count
  • For n=1000, SFP requires approximately 10-15 iterations, while FP requires 1000
  • Validates theoretical O(log n) complexity

f-finger Morra Game:

  • SFP and SAFP converge rapidly, even for large f
  • FP and AFP iteration count grows with f²
  • For f=10, SFP requires approximately 50 iterations, FP requires 200+

SDO Performance on Different Games

Matrix L and U^⊤ (Theoretical Perturbation U(-1,1)):

  • SDO achieves theoretically predicted logarithmic complexity
  • For n=2000, SDO requires approximately 10-15 iterations
  • DO requires n iterations (not shown in figures due to scale)
  • Perfectly validates Theorems 6 and 7

f-finger Morra Game:

  • Perturbation significantly reduces iteration count
  • Both U(-0.01,0.01) and U(-0.001,0.001) are effective
  • Smaller perturbation (U(-0.001,0.001)) shows more stable performance at large scales

Colonel Blotto Game:

  • 5 battlefields, unit budgets 0-40
  • Perturbation methods consistently outperform non-perturbed versions
  • U(-0.01,0.01) shows best overall performance

Efficient Perturbation Experiments

n-bit Random Stochastic Games (Corresponding to L and U^⊤):

  • Uses clustering perturbation method
  • For n=2000 (2^11 scale), requires approximately 100 iterations
  • Although slightly slower than theoretical perturbation, far superior to DO's linear complexity
  • Demonstrates feasibility of efficient implementation

Path Planning Games:

  • Tested on n×n grids
  • Perturbation significantly reduces iteration count
  • Advantage becomes more pronounced as grid size increases
  • For n=14, perturbed version requires approximately 100 iterations, non-perturbed requires 200+

Ablation Study Observations

Perturbation Strength Impact:

  • Excessive perturbation may harm convergence (observed in random games)
  • U(-0.001,0.001) shows stable performance in most cases
  • U(-0.01,0.01) more effective in structured games

Perturbation Distribution Choice:

  • Gumbel distribution: Theoretically optimal, suitable for SFP
  • Uniform distribution: Simpler in practice, effective for SDO
  • Both show comparable performance in experiments

Key Findings

  1. Game Type Dependency: Perturbation is highly effective in structured/difficult instances, with less obvious advantages in random games
  2. Theory-Practice Consistency: Experimental results highly consistent with theoretical predictions
  3. Efficient Implementation Feasibility: Clustering method maintains effectiveness while significantly reducing computational cost
  4. Robustness: Perturbation method shows stable performance across multiple game types

Fictitious Play Convergence Research

  • Robinson (1951): Proves FP converges to NE in zero-sum games
  • Karlin Conjecture: Oldest open problem regarding FP convergence speed
  • Partial Results: Results by Abernethy et al. (2021), Daskalakis and Pan (2014) for specific matrix types
  • AFP: Proposed by Cloud et al. (2023), still requires O(n) iterations but with smaller constants, calls BRO 4 times per round

Regularization Methods

  • Hofbauer and Sandholm (2002): Establishes connection between perturbation and regularization
  • PU and OMWU: Regularized AFP variants by Cen et al. (2024), achieve O(log n) but require maintaining probability distributions over all strategies
  • This Paper's Distinction: PBRO only maintains subset of selected strategies, suitable for large-scale games

Double Oracle Research

  • Baseline Complexity: DO known to require Θ(n) iterations, but lacks deep theoretical investigation
  • Zhang and Sandholm (2024): Prove exponential lower bound for DO on POSG
  • Variants: Self-Play PSRO by McAleer et al. (2022) and others, but lack convergence guarantees
  • This Paper's Contribution: First systematic study of SDO convergence properties

Theoretical Limits of BRO Algorithms

  • Althöfer (1994), Lipton and Young (1994): Existence of O(log n)-sized ε-NE
  • Hazan and Koren (2016):
    • Lower bound: Any BRO algorithm requires Ω(√n/log³n) iterations
    • Upper bound: Propose O(√n/ε²) algorithm
  • This Paper's Breakthrough: Breaks theoretical lower bound by enhancing BRO (adding perturbation)

Deep Reinforcement Learning Applications

  • PSRO Series: Lanctot et al. (2017), McAleer et al. (2022), Bighashdel et al. (2024)
  • Connection: These methods rely on DO framework, this paper's SDO directly applicable
  • Potential Impact: Perturbation mechanism may improve exploration strategies in deep RL

Conclusions and Discussion

Main Conclusions

  1. Theoretical Breakthrough:
    • SFP achieves O(log n/ε²) expected complexity, first proof of logarithmic convergence for PBRO algorithms
    • SDO achieves O(log n) expected complexity on specific difficult instances
  2. Practical Value:
    • Perturbation method significantly reduces iterations in structured games
    • Efficient implementation scheme makes method applicable to large-scale games
  3. Methodological Contributions:
    • Establishes theoretical connection between SFP and REWF
    • Provides framework using drift theory for analyzing SDO

Limitations

  1. SDO General Complexity Unresolved:
    • Only proved logarithmic complexity for specific instances
    • General case complexity remains open problem
  2. Random Game Performance:
    • Perturbation advantage not obvious in randomly generated games
    • Possibly because random games inherently have stochastic best responses
  3. Perturbation Parameter Selection:
    • Theoretically optimal parameters (β) may be too large in practice
    • Requires adjustment based on specific game
  4. Approximate Nature of Efficient Implementation:
    • Clustering perturbation not completely equivalent to full perturbation
    • Theoretical guarantees only apply to full perturbation
  5. Computational Cost:
    • While reducing iterations, each iteration requires random number generation
    • For some games, benefits may not offset additional overhead

Future Directions

  1. General SDO Complexity Analysis:
    • Prove or disprove logarithmic complexity for SDO in general case
    • Identify game characteristics where SDO is efficient
  2. Adaptive Perturbation Strategies:
    • Dynamically adjust perturbation strength based on convergence
    • Explore theoretical properties of different perturbation distributions
  3. Deep Reinforcement Learning Integration:
    • Integrate PBRO into PSRO framework
    • Study perturbation effects when BRO is approximated by neural networks
  4. Broader Game Categories:
    • Extend to general-sum games
    • Investigate perturbation mechanisms in multi-player games
  5. Practical Application Validation:
    • Test in real game scenarios (poker, strategy games)
    • Evaluate performance under actual computational resource constraints

In-Depth Evaluation

Strengths

  1. Theoretical Rigor:
    • Complete mathematical proofs from Gumbel-max trick to drift theory
    • Clear connection established between SFP and classical online learning algorithm (REWF)
    • Theoretical results highly consistent with experiments
  2. Important Problem Selection:
    • Addresses long-standing theory-practice gap in the field
    • Breaks Hazan-Koren lower bound by enhancing oracle
    • Direct application value for deep reinforcement learning
  3. Method Innovation:
    • Simple yet effective perturbation mechanism
    • Efficient implementation scheme solves computational bottleneck
    • Unified framework applicable to both FP and DO algorithms
  4. Comprehensive Experiments:
    • Covers theoretical instances, classic games, random games, structured games
    • Compares multiple baseline methods
    • Honestly reports negative results in random games
  5. Writing Clarity:
    • Sufficient background introduction, clear motivation
    • Complete technical details, strong reproducibility
    • Provides open-source code

Weaknesses

  1. Theoretical Completeness:
    • SDO general complexity unresolved, only specific case analysis
    • Lacks theoretical explanation for perturbation failure in random games
    • Quantitative gap between efficient and full perturbation not characterized
  2. Experimental Design:
    • Relatively small scale for random game experiments (n≤1000)
    • Missing direct comparison with Hazan-Koren algorithm
    • Reports only iteration count, not actual runtime
  3. Practical Considerations:
    • Lacks universal guidance for perturbation parameter selection
    • No clear criteria for when to use perturbation
    • Applicability scope of efficient implementation not sufficiently clear
  4. Analysis Depth:
    • Insufficient intuitive explanation for why perturbation works
    • Limited analysis of relationship between game structure and perturbation effectiveness
    • Relatively simple ablation studies
  5. Comparison Fairness:
    • AFP calls BRO 4 times per round, while FP/SFP call only 2 times
    • Should also report comparison in total BRO calls

Impact

  1. Theoretical Contribution:
    • Provides new direction for BRO algorithm research
    • Demonstrates potential of enhanced oracles
    • May inspire research in other combinatorial optimization problems
  2. Practical Value:
    • Directly applicable to existing DO/FP frameworks
    • Potential to improve PSRO methods in deep RL
    • Efficient implementation scheme practically operable
  3. Limitations:
    • Requires further theoretical development to become standard method
    • Poor performance in random games limits universality
    • Computational overhead needs verification in large-scale applications
  4. Reproducibility:
    • Provides open-source code, strong reproducibility
    • Clear algorithm description, easy to implement
    • Detailed experimental setup

Applicable Scenarios

Strongly Recommended:

  1. Games with clear structure (EFG, POSG)
  2. Games with multiple equivalent best responses
  3. Difficult instances where traditional DO/FP converges slowly
  4. Games with huge strategy space but internal structure

Use with Caution:

  1. Completely random games
  2. Games with unique, clear best responses
  3. Scenarios with extremely limited computational resources
  4. Applications requiring deterministic guarantees

Not Recommended:

  1. Small-scale games (direct solution faster)
  2. Unstructured general games (perturbation advantage unclear)
  3. Real-time decision scenarios (randomness may be unacceptable)

References

Key Theoretical Foundations:

  1. Althöfer (1994) - Existence of logarithmic-sized ε-NE
  2. Hazan & Koren (2016) - Theoretical bounds for BRO algorithms
  3. Hofbauer & Sandholm (2002) - SFP convergence proof
  4. Cesa-Bianchi & Lugosi (2006) - REWF algorithm

Related Work: 5. Zhang & Sandholm (2024) - Exponential lower bound for DO 6. Cloud et al. (2023) - Anticipatory Fictitious Play 7. Lanctot et al. (2017) - Policy Space Response Oracles 8. Cen et al. (2024) - Regularized game algorithms


Overall Assessment: This is an excellent paper combining theory and practice well. The main contribution is proving that perturbation mechanisms enable BRO algorithms to achieve logarithmic complexity, breaking known theoretical lower bounds (by enhancing the oracle). The theoretical results for SFP are complete and rigorous, while SDO analysis, though incomplete, provides valuable insights. The experimental design is comprehensive and honestly reports the method's applicable scope. Main limitations are unresolved general SDO complexity and lack of theoretical explanation for poor performance in random games. This work opens new directions for game theory algorithm research and has potential application value for deep reinforcement learning, warranting further investigation.