This paper investigates the impact of perturbations on the best-response-based algorithms approximating Nash equilibria in zero-sum games, namely Double Oracle and Fictitious Play. More precisely, we assume that the oracle computing the best responses perturbs the utilities before selecting the best response. We show that using such an oracle reduces the number of iterations for both algorithms. For some cases, suitable perturbations ensure the expected number of iterations is logarithmic. Although the utility perturbation is computationally demanding as it requires iterating through all pure strategies, we demonstrate that one can efficiently perturb the utilities in games where pure strategies have further inner structure.
This paper investigates the effects of perturbations on best-response-based algorithms used to approximate Nash equilibria in zero-sum games, with particular focus on Double Oracle and Fictitious Play algorithms. The research assumes that the oracle computing best responses perturbs utilities before selecting the best response. The study demonstrates that using such a perturbed oracle can reduce the number of iterations required by both algorithms. In certain cases, appropriate perturbations can ensure that the expected number of iterations is logarithmic. Although utility perturbation requires traversing all pure strategies with high computational cost, the research proves that for games where pure strategies have internal structure (such as partially observable stochastic games), perturbations can be implemented efficiently.
Computing Nash equilibria in two-player zero-sum games with large strategy spaces is computationally intensive. Although it is theoretically known that ε-Nash equilibria of size O(log n) exist (where n is the number of pure strategies), traditional best-response oracle (BRO)-based algorithms such as Fictitious Play (FP) and Double Oracle (DO) require Ω(n) iterations in the worst case.
Theory-Practice Gap: Althöfer and Lipton et al. proved the existence of logarithmic-sized ε-NE, but practical algorithms cannot achieve this complexity
Lower Bound Constraints: Hazan and Koren proved that any BRO-based algorithm requires at least Ω(√n/log³n) iterations
Practical Application Needs: Deep reinforcement learning algorithms (such as Policy Space Response Oracles) rely on these foundational algorithms
FP and DO: Require O(n) iterations in the worst case
Hazan-Koren Algorithm: While providing O(√n/ε²) complexity, the algorithm is complex and offers only square-root improvement
Regularization Methods: Such as PU and OMWU achieve O(log n) iterations but require maintaining probability distributions over all pure strategies, unsuitable for large-scale games
By introducing perturbations in the best-response oracle to make it more powerful, the research aims to break through the theoretical lower bound limitations and achieve logarithmic convergence speed.
Introduction of Perturbed Best Response Oracle (PBRO): Proposes a mechanism that randomly perturbs utilities before computing best responses
Theoretical Results:
Proves that Stochastic Fictitious Play (SFP) achieves O(log n/ε²) expected iteration complexity
Proves that Stochastic Double Oracle (SDO) achieves O(log n) expected iterations for specific difficult instances (Zhang and Sandholm's example)
Efficient Implementation Scheme: Proposes efficient perturbation methods for games with internal structure (such as POSG, path planning games) without requiring traversal of all pure strategies
Experimental Validation: Verifies the effectiveness of perturbation methods on multiple game types, including matrix games, stochastic games, and path planning games
Theoretical Tools: Utilizes the Gumbel-max trick to establish connections between SFP and stochastic exponential weighted forecasting (REWF)
Theorem 3 (SFP Complexity):
For matrix game M ∈ ℝ^(m×n) normalized to 0,1, m ≤ n, setting β = (2+√(2ln n))/(ε√(8ln n)), SFP finds ε-NE within O(log n/ε²) expected iterations.
Proof Sketch:
Set iteration count T = ((2+√(2ln n))/ε)² ∈ O(log n/ε²)
Using Corollary 2 (REWF regret bound), with probability ≥3/4:
Lemma 4 (Expected Value of Uniform Perturbation):
For the k-th row x = ⟨1,...,1,0,-1,...,-1⟩, using U(-1/2,1/2) perturbation,
the expected index of perturbed best response EI = k/2
Theorem 6: SDO finds NE of L within O(log n) expected iterations
Proof Technique:
Define potential function X_t = max{r_t, c_t}, where r_t = min R_t, c_t = min C_t
Related Work:
5. Zhang & Sandholm (2024) - Exponential lower bound for DO
6. Cloud et al. (2023) - Anticipatory Fictitious Play
7. Lanctot et al. (2017) - Policy Space Response Oracles
8. Cen et al. (2024) - Regularized game algorithms
Overall Assessment: This is an excellent paper combining theory and practice well. The main contribution is proving that perturbation mechanisms enable BRO algorithms to achieve logarithmic complexity, breaking known theoretical lower bounds (by enhancing the oracle). The theoretical results for SFP are complete and rigorous, while SDO analysis, though incomplete, provides valuable insights. The experimental design is comprehensive and honestly reports the method's applicable scope. Main limitations are unresolved general SDO complexity and lack of theoretical explanation for poor performance in random games. This work opens new directions for game theory algorithm research and has potential application value for deep reinforcement learning, warranting further investigation.