2025-11-18T08:49:13.884110

Perturbing Best Responses in Zero-Sum Games

Dziwoki, Horcik

This paper investigates the impact of perturbations on the best-response-based algorithms approximating Nash equilibria in zero-sum games, namely Double Oracle and Fictitious Play. More precisely, we assume that the oracle computing the best responses perturbs the utilities before selecting the best response. We show that using such an oracle reduces the number of iterations for both algorithms. For some cases, suitable perturbations ensure the expected number of iterations is logarithmic. Although the utility perturbation is computationally demanding as it requires iterating through all pure strategies, we demonstrate that one can efficiently perturb the utilities in games where pure strategies have further inner structure.

academic

Perturbing Best Responses in Zero-Sum Games

Basic Information

Paper ID: 2511.12523
Title: Perturbing Best Responses in Zero-Sum Games
Authors: Adam Dziwoki, Rostislav Horčík (Czech Technical University in Prague)
Classification: cs.GT (Game Theory), cs.AI (Artificial Intelligence)
Publication Date: November 16, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.12523
Code Link: https://github.com/geoborek/perturbing-best-responses

Abstract

This paper investigates the effects of perturbations on best-response-based algorithms used to approximate Nash equilibria in zero-sum games, with particular focus on Double Oracle and Fictitious Play algorithms. The research assumes that the oracle computing best responses perturbs utilities before selecting the best response. The study demonstrates that using such a perturbed oracle can reduce the number of iterations required by both algorithms. In certain cases, appropriate perturbations can ensure that the expected number of iterations is logarithmic. Although utility perturbation requires traversing all pure strategies with high computational cost, the research proves that for games where pure strategies have internal structure (such as partially observable stochastic games), perturbations can be implemented efficiently.

Research Background and Motivation

Core Problem

Computing Nash equilibria in two-player zero-sum games with large strategy spaces is computationally intensive. Although it is theoretically known that ε-Nash equilibria of size O(log n) exist (where n is the number of pure strategies), traditional best-response oracle (BRO)-based algorithms such as Fictitious Play (FP) and Double Oracle (DO) require Ω(n) iterations in the worst case.

Significance

Theory-Practice Gap: Althöfer and Lipton et al. proved the existence of logarithmic-sized ε-NE, but practical algorithms cannot achieve this complexity
Lower Bound Constraints: Hazan and Koren proved that any BRO-based algorithm requires at least Ω(√n/log³n) iterations
Practical Application Needs: Deep reinforcement learning algorithms (such as Policy Space Response Oracles) rely on these foundational algorithms

Limitations of Existing Methods

FP and DO: Require O(n) iterations in the worst case
Hazan-Koren Algorithm: While providing O(√n/ε²) complexity, the algorithm is complex and offers only square-root improvement
Regularization Methods: Such as PU and OMWU achieve O(log n) iterations but require maintaining probability distributions over all pure strategies, unsuitable for large-scale games

Research Motivation

By introducing perturbations in the best-response oracle to make it more powerful, the research aims to break through the theoretical lower bound limitations and achieve logarithmic convergence speed.

Core Contributions

Introduction of Perturbed Best Response Oracle (PBRO): Proposes a mechanism that randomly perturbs utilities before computing best responses
Theoretical Results:
- Proves that Stochastic Fictitious Play (SFP) achieves O(log n/ε²) expected iteration complexity
- Proves that Stochastic Double Oracle (SDO) achieves O(log n) expected iterations for specific difficult instances (Zhang and Sandholm's example)
Efficient Implementation Scheme: Proposes efficient perturbation methods for games with internal structure (such as POSG, path planning games) without requiring traversal of all pure strategies
Experimental Validation: Verifies the effectiveness of perturbation methods on multiple game types, including matrix games, stochastic games, and path planning games
Theoretical Tools: Utilizes the Gumbel-max trick to establish connections between SFP and stochastic exponential weighted forecasting (REWF)

Method Details

Task Definition

Input: Matrix game M ∈ ℝ^(m×n), precision parameter ε > 0 Output: ε-Nash equilibrium strategy pair ⟨p*, q*⟩ Constraint: Minimize the number of iterations (oracle calls)

Mathematical Definition of Perturbed Best Response

For the row player, given column player's mixed strategy q ∈ Δ_n:

B̃R_r(q) ∈ argmin_{i∈[m]} π_i(Mq - u)

where u is a random vector with i.i.d. random variable components.

For the column player, given row player's mixed strategy p ∈ Δ_m:

B̃R_c(p) ∈ argmax_{j∈[n]} π_j(p^⊤M + v)

Stochastic Fictitious Play (SFP)

Algorithm Flow:

Initialize: t←1, p←e_k, q←e_l (initial pure strategies)
Compute termination boundary: lb←BRVal_r(q), ub←BRVal_c(p)
While ub - lb > tε:
- t←t+1
- i←B̃R_r(q), j←B̃R_c(p) (perturbed best response)
- p←p+e_i, q←q+e_j (accumulate strategies)
- Update boundaries
Return average strategy: ⟨p*/t, q*/t⟩

Key Innovations:

Uses perturbed oracle instead of deterministic oracle
Maintains cumulative strategy vectors, returns average strategy at the end
Termination condition uses unperturbed best response values

Theoretical Foundation of Gumbel Perturbation

Gumbel-max Trick (Lemma 1): For vector x ∈ ℝ^n, if z has components independently distributed as Gumbel distribution G(0,β):

P(argmax_i π_i(x+z) = i*) = π_i*(softmax(x/β))

This means that using Gumbel-perturbed best response is equivalent to sampling from a softmax distribution.

Connection to REWF:

Strategy selection by the row player in SFP is equivalent to REWF strategy
In round t, samples from softmax(-η∑^{t-1} Me)
Parameter η = 1/β controls the exploration-exploitation tradeoff

Core Theoretical Analysis

Theorem 3 (SFP Complexity): For matrix game M ∈ ℝ^(m×n) normalized to 0,1, m ≤ n, setting β = (2+√(2ln n))/(ε√(8ln n)), SFP finds ε-NE within O(log n/ε²) expected iterations.

Proof Sketch:

Set iteration count T = ((2+√(2ln n))/ε)² ∈ O(log n/ε²)

Using Corollary 2 (REWF regret bound), with probability ≥3/4:

∑_{t=1}^T M(i_t,j_t) - min_i ∑_{t=1}^T M(i,j_t) ≤ √T(1 + √(ln n)/2)

Apply dual result for column player (probability ≥3/4)
Both events occur simultaneously with probability ≥1/2
Combining both inequalities yields ub - lb ≤ ε
Expected running time ≤2

Stochastic Double Oracle (SDO)

Algorithm Characteristics:

Maintains strategy subsets R ⊆ m, C ⊆ n
Each round computes Nash equilibrium of subgame MR,C
Uses perturbed oracle to add new strategies

Analysis for Specific Games:

Example 1 (Matrix L):

L = [0   1   ...  1  ]
    [-1  0   ...  1  ]
    [⋮   ⋮   ⋱   ⋮  ]
    [-1  -1  ... 0  ]

Lemma 4 (Expected Value of Uniform Perturbation): For the k-th row x = ⟨1,...,1,0,-1,...,-1⟩, using U(-1/2,1/2) perturbation, the expected index of perturbed best response EI = k/2

Theorem 6: SDO finds NE of L within O(log n) expected iterations

Proof Technique:

Define potential function X_t = max{r_t, c_t}, where r_t = min R_t, c_t = min C_t
Use drift theory to prove X_t - EX_{t+2} ≥ X_t/2
Apply Corollary 17 to obtain ET ≤ 2ln n

Efficient Perturbation Implementation

Clustering Method: For games with internal structure (such as stochastic games):

Identify matrix elements corresponding to the same terminal state
Cluster matrix elements into K clusters
Apply the same random perturbation value to each cluster

Perturbation Formula:

B̃R_r(q) = argmin_{i∈[m]} π_i((L + ∑_{k=1}^K z_k B_k)q)

where B_k is a mask matrix identifying elements in the k-th cluster

Advantages:

Only K random numbers need to be generated (K << mn)
Preserves semantic structure of the game
Applicable to structured games like POSG and EFG

Experimental Setup

Datasets and Game Types

Random Matrix Games: n×n matrices with elements uniformly sampled from 0,1, 100 instances per size
Matrix L and U^⊤: Difficult instances from Examples 1 and 2
f-finger Morra Game: Classic game with matrix size f²×f²
Colonel Blotto Game: 5 battlefields with different unit budgets
n-bit Random Stochastic Games: Corresponding to 2^n×2^n matrices with tree structure
Path Planning Games: n×n grid where one player seeks shortest path and the other chooses edges to increase cost

Evaluation Metrics

Iteration Count: Number of oracle calls needed to reach ε-NE
ε Value: Set to 0.1
Statistics: Mean and standard deviation over 10 repeated experiments

Comparison Methods

FP: Standard Fictitious Play
AFP: Anticipatory Fictitious Play
SAFP: Perturbed version of AFP
DO: Standard Double Oracle
SDO: Perturbed version of Double Oracle

Implementation Details

Programming Language: Python
Hardware: AMD Ryzen 7 PRO 7840U
Random Number Generation: numpy library, initial seed 1
Initialization: Worst-case index (k=l=n)
Tie-breaking: Select best response with minimum index
SFP β Parameter: Set according to Theorem 3
SDO Perturbation Distribution:
- Theoretical analysis: U(-1,1)
- Practical application: U(-0.01,0.01) and U(-0.001,0.001)

Experimental Results

Main Results

SFP Performance on Different Games

Random 0,1 Matrix Games:

AFP performs best, significantly outperforming other methods
SFP and SAFP show similar performance, slightly better than FP
Perturbation advantage not obvious in random games
Iteration count grows nearly linearly with scale

Matrix U^⊤:

FP and AFP show O(n) worst-case complexity
SFP and SAFP significantly reduce iteration count
For n=1000, SFP requires approximately 10-15 iterations, while FP requires 1000
Validates theoretical O(log n) complexity

f-finger Morra Game:

SFP and SAFP converge rapidly, even for large f
FP and AFP iteration count grows with f²
For f=10, SFP requires approximately 50 iterations, FP requires 200+

SDO Performance on Different Games

Matrix L and U^⊤ (Theoretical Perturbation U(-1,1)):

SDO achieves theoretically predicted logarithmic complexity
For n=2000, SDO requires approximately 10-15 iterations
DO requires n iterations (not shown in figures due to scale)
Perfectly validates Theorems 6 and 7

f-finger Morra Game:

Perturbation significantly reduces iteration count
Both U(-0.01,0.01) and U(-0.001,0.001) are effective
Smaller perturbation (U(-0.001,0.001)) shows more stable performance at large scales

Colonel Blotto Game:

5 battlefields, unit budgets 0-40
Perturbation methods consistently outperform non-perturbed versions
U(-0.01,0.01) shows best overall performance

Efficient Perturbation Experiments

n-bit Random Stochastic Games (Corresponding to L and U^⊤):

Uses clustering perturbation method
For n=2000 (2^11 scale), requires approximately 100 iterations
Although slightly slower than theoretical perturbation, far superior to DO's linear complexity
Demonstrates feasibility of efficient implementation

Path Planning Games:

Tested on n×n grids
Perturbation significantly reduces iteration count
Advantage becomes more pronounced as grid size increases
For n=14, perturbed version requires approximately 100 iterations, non-perturbed requires 200+

Ablation Study Observations

Perturbation Strength Impact:

Excessive perturbation may harm convergence (observed in random games)
U(-0.001,0.001) shows stable performance in most cases
U(-0.01,0.01) more effective in structured games

Perturbation Distribution Choice:

Gumbel distribution: Theoretically optimal, suitable for SFP
Uniform distribution: Simpler in practice, effective for SDO
Both show comparable performance in experiments

Key Findings

Game Type Dependency: Perturbation is highly effective in structured/difficult instances, with less obvious advantages in random games
Theory-Practice Consistency: Experimental results highly consistent with theoretical predictions
Efficient Implementation Feasibility: Clustering method maintains effectiveness while significantly reducing computational cost
Robustness: Perturbation method shows stable performance across multiple game types

Fictitious Play Convergence Research

Robinson (1951): Proves FP converges to NE in zero-sum games
Karlin Conjecture: Oldest open problem regarding FP convergence speed
Partial Results: Results by Abernethy et al. (2021), Daskalakis and Pan (2014) for specific matrix types
AFP: Proposed by Cloud et al. (2023), still requires O(n) iterations but with smaller constants, calls BRO 4 times per round

Regularization Methods

Hofbauer and Sandholm (2002): Establishes connection between perturbation and regularization
PU and OMWU: Regularized AFP variants by Cen et al. (2024), achieve O(log n) but require maintaining probability distributions over all strategies
This Paper's Distinction: PBRO only maintains subset of selected strategies, suitable for large-scale games

Double Oracle Research

Baseline Complexity: DO known to require Θ(n) iterations, but lacks deep theoretical investigation
Zhang and Sandholm (2024): Prove exponential lower bound for DO on POSG
Variants: Self-Play PSRO by McAleer et al. (2022) and others, but lack convergence guarantees
This Paper's Contribution: First systematic study of SDO convergence properties

Theoretical Limits of BRO Algorithms

Althöfer (1994), Lipton and Young (1994): Existence of O(log n)-sized ε-NE
Hazan and Koren (2016):
- Lower bound: Any BRO algorithm requires Ω(√n/log³n) iterations
- Upper bound: Propose O(√n/ε²) algorithm
This Paper's Breakthrough: Breaks theoretical lower bound by enhancing BRO (adding perturbation)

Deep Reinforcement Learning Applications

PSRO Series: Lanctot et al. (2017), McAleer et al. (2022), Bighashdel et al. (2024)
Connection: These methods rely on DO framework, this paper's SDO directly applicable
Potential Impact: Perturbation mechanism may improve exploration strategies in deep RL

Conclusions and Discussion

Main Conclusions

Theoretical Breakthrough:
- SFP achieves O(log n/ε²) expected complexity, first proof of logarithmic convergence for PBRO algorithms
- SDO achieves O(log n) expected complexity on specific difficult instances
Practical Value:
- Perturbation method significantly reduces iterations in structured games
- Efficient implementation scheme makes method applicable to large-scale games
Methodological Contributions:
- Establishes theoretical connection between SFP and REWF
- Provides framework using drift theory for analyzing SDO

Limitations

SDO General Complexity Unresolved:
- Only proved logarithmic complexity for specific instances
- General case complexity remains open problem
Random Game Performance:
- Perturbation advantage not obvious in randomly generated games
- Possibly because random games inherently have stochastic best responses
Perturbation Parameter Selection:
- Theoretically optimal parameters (β) may be too large in practice
- Requires adjustment based on specific game
Approximate Nature of Efficient Implementation:
- Clustering perturbation not completely equivalent to full perturbation
- Theoretical guarantees only apply to full perturbation
Computational Cost:
- While reducing iterations, each iteration requires random number generation
- For some games, benefits may not offset additional overhead

Future Directions

General SDO Complexity Analysis:
- Prove or disprove logarithmic complexity for SDO in general case
- Identify game characteristics where SDO is efficient
Adaptive Perturbation Strategies:
- Dynamically adjust perturbation strength based on convergence
- Explore theoretical properties of different perturbation distributions
Deep Reinforcement Learning Integration:
- Integrate PBRO into PSRO framework
- Study perturbation effects when BRO is approximated by neural networks
Broader Game Categories:
- Extend to general-sum games
- Investigate perturbation mechanisms in multi-player games
Practical Application Validation:
- Test in real game scenarios (poker, strategy games)
- Evaluate performance under actual computational resource constraints

In-Depth Evaluation

Strengths

Theoretical Rigor:
- Complete mathematical proofs from Gumbel-max trick to drift theory
- Clear connection established between SFP and classical online learning algorithm (REWF)
- Theoretical results highly consistent with experiments
Important Problem Selection:
- Addresses long-standing theory-practice gap in the field
- Breaks Hazan-Koren lower bound by enhancing oracle
- Direct application value for deep reinforcement learning
Method Innovation:
- Simple yet effective perturbation mechanism
- Efficient implementation scheme solves computational bottleneck
- Unified framework applicable to both FP and DO algorithms
Comprehensive Experiments:
- Covers theoretical instances, classic games, random games, structured games
- Compares multiple baseline methods
- Honestly reports negative results in random games
Writing Clarity:
- Sufficient background introduction, clear motivation
- Complete technical details, strong reproducibility
- Provides open-source code

Weaknesses

Theoretical Completeness:
- SDO general complexity unresolved, only specific case analysis
- Lacks theoretical explanation for perturbation failure in random games
- Quantitative gap between efficient and full perturbation not characterized
Experimental Design:
- Relatively small scale for random game experiments (n≤1000)
- Missing direct comparison with Hazan-Koren algorithm
- Reports only iteration count, not actual runtime
Practical Considerations:
- Lacks universal guidance for perturbation parameter selection
- No clear criteria for when to use perturbation
- Applicability scope of efficient implementation not sufficiently clear
Analysis Depth:
- Insufficient intuitive explanation for why perturbation works
- Limited analysis of relationship between game structure and perturbation effectiveness
- Relatively simple ablation studies
Comparison Fairness:
- AFP calls BRO 4 times per round, while FP/SFP call only 2 times
- Should also report comparison in total BRO calls

Impact

Theoretical Contribution:
- Provides new direction for BRO algorithm research
- Demonstrates potential of enhanced oracles
- May inspire research in other combinatorial optimization problems
Practical Value:
- Directly applicable to existing DO/FP frameworks
- Potential to improve PSRO methods in deep RL
- Efficient implementation scheme practically operable
Limitations:
- Requires further theoretical development to become standard method
- Poor performance in random games limits universality
- Computational overhead needs verification in large-scale applications
Reproducibility:
- Provides open-source code, strong reproducibility
- Clear algorithm description, easy to implement
- Detailed experimental setup

Applicable Scenarios

Strongly Recommended:

Games with clear structure (EFG, POSG)
Games with multiple equivalent best responses
Difficult instances where traditional DO/FP converges slowly
Games with huge strategy space but internal structure

Use with Caution:

Completely random games
Games with unique, clear best responses
Scenarios with extremely limited computational resources
Applications requiring deterministic guarantees

Not Recommended:

Small-scale games (direct solution faster)
Unstructured general games (perturbation advantage unclear)
Real-time decision scenarios (randomness may be unacceptable)

References

Key Theoretical Foundations:

Althöfer (1994) - Existence of logarithmic-sized ε-NE
Hazan & Koren (2016) - Theoretical bounds for BRO algorithms
Hofbauer & Sandholm (2002) - SFP convergence proof
Cesa-Bianchi & Lugosi (2006) - REWF algorithm

Related Work: 5. Zhang & Sandholm (2024) - Exponential lower bound for DO 6. Cloud et al. (2023) - Anticipatory Fictitious Play 7. Lanctot et al. (2017) - Policy Space Response Oracles 8. Cen et al. (2024) - Regularized game algorithms

Overall Assessment: This is an excellent paper combining theory and practice well. The main contribution is proving that perturbation mechanisms enable BRO algorithms to achieve logarithmic complexity, breaking known theoretical lower bounds (by enhancing the oracle). The theoretical results for SFP are complete and rigorous, while SDO analysis, though incomplete, provides valuable insights. The experimental design is comprehensive and honestly reports the method's applicable scope. Main limitations are unresolved general SDO complexity and lack of theoretical explanation for poor performance in random games. This work opens new directions for game theory algorithm research and has potential application value for deep reinforcement learning, warranting further investigation.