2025-11-16T22:28:12.942550

Reinforcing Competitive Multi-Agents for Playing 'So Long Sucker'

Sharan, Adak
This paper investigates the strategy game So Long Sucker (SLS) as a novel benchmark for multi-agent reinforcement learning (MARL). Unlike traditional board or video game testbeds, SLS is distinguished by its coalition formation, strategic deception, and dynamic elimination rules, making it a uniquely challenging environment for autonomous agents. We introduce the first publicly available computational framework for SLS, complete with a graphical user interface and benchmarking support for reinforcement learning algorithms. Using classical deep reinforcement learning methods (e.g., DQN, DDQN, and Dueling DQN), we train self-playing agents to learn the rules and basic strategies of SLS. Experimental results demonstrate that, although these agents achieve roughly half of the maximum attainable reward and consistently outperform random baselines, they require long training horizons (~2000 games) and still commit occasional illegal moves, highlighting both the promise and limitations of classical reinforcement learning. Our findings establish SLS as a negotiation-aware benchmark for MARL, opening avenues for future research that integrates game-theoretic reasoning, coalition-aware strategies, and advanced reinforcement learning architectures to better capture the social and adversarial dynamics of complex multi-agent games.
academic

Reinforcing Competitive Multi-Agents for Playing 'So Long Sucker'

Basic Information

  • Paper ID: 2411.11057
  • Title: Reinforcing Competitive Multi-Agents for Playing 'So Long Sucker'
  • Authors: Medant Sharan (King's College London), Chandranath Adak (IIT Patna)
  • Classification: cs.AI
  • Publication Date: November 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2411.11057

Abstract

This paper introduces the strategic game "So Long Sucker" (SLS) to the multi-agent reinforcement learning (MARL) domain as a novel benchmark for the first time. Unlike traditional board games or video game testing platforms, SLS features coalition formation, strategic deception, and dynamic elimination rules, providing a unique challenging environment for autonomous agents. The researchers developed the first publicly available computational framework for SLS, complete with a graphical user interface and reinforcement learning algorithm benchmarking support. Classical deep reinforcement learning methods (DQN, DDQN, Dueling DQN) were employed to train self-play agents to learn SLS rules and basic strategies. Experimental results demonstrate that while these agents achieve approximately half of the maximum obtainable reward and consistently outperform random baselines, they require extended training periods (approximately 2000 games) and still occasionally execute illegal actions, highlighting both the potential and limitations of classical reinforcement learning.

Research Background and Motivation

Problem Definition

Existing multi-agent reinforcement learning benchmarks primarily focus on purely cooperative objectives (such as coordination tasks) or adversarial competition (such as two-player zero-sum games), lacking mixed environments capable of simultaneously capturing coalition formation and betrayal dynamics. Although breakthroughs have been achieved in domains such as Go, StarCraft II, and Diplomacy, these benchmarks do not adequately reflect the unique coalition and betrayal dynamics inherent to SLS.

Research Significance

SLS, designed by Hausner, Nash, Shapley, and Shubik as a four-player strategic game, revolves around coalition formation, temporary alliances, and inevitable betrayal. Victory depends not only on legal moves but also on diplomacy and opportunism, making it a unique testing platform for studying trust, negotiation, and social dilemmas.

Limitations of Existing Approaches

  1. Most MARL benchmarks lack the mixed dynamics of coalition and betrayal
  2. Prior work on socially rich settings typically relies on explicit communication channels or hand-crafted interaction rules
  3. SLS has not previously been studied as a computational benchmark

Research Motivation

By formalizing SLS as a reproducible sequential variant and benchmarking baseline DRL algorithms, this paper positions SLS as a coalition and betrayal-aware testing platform for advancing MARL research.

Core Contributions

  1. First SLS Computational Framework: Designed and released the first computational framework for SLS specifically tailored for reinforcement learning research, equipped with a GUI for experimentation
  2. Classical DRL Algorithm Benchmarking: Benchmarked classical DRL algorithms (DQN, DDQN, Dueling DQN) in SLS, analyzing their capacity to achieve legal game proficiency and partial strategic awareness
  3. Coalition and Betrayal-Aware Benchmark: Established SLS as a coalition and betrayal-aware benchmark for MARL, inspiring future research combining DRL with game-theoretic reasoning

Methodology Details

Task Definition

SLS is converted into a MARL environment using the generalized Hofstra version of the zero-sum variant. Four players, each assigned a unique color, start with 5 chips of their color and play on a board with at most 6 active stacks. The winning condition is to be the last surviving player.

Reinforcement Learning Formalization

SLS is modeled as a Markov Decision Process (MDP):

  • State Space S: The set of all possible game states
  • Action Space A: The set of all available actions for agents (discrete valid moves)
  • Transition Function: p(s'|s,a) represents the probability of transitioning to s' after executing action a in state s
  • Reward Function: r(s,a,s') assigns a scalar value to each transition
  • Policy: π(a|s) is the agent's policy for selecting action a given state s

The objective is to find the optimal policy π* that maximizes expected discounted return: Rt=k=0γkrt+k+1R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}

State Representation

State s_t encodes all information necessary to describe the game environment: st=(Board Configuration,Player Chips,Eliminated Chips,Current Player,Game Phase,Step Count)s_t = (Board\ Configuration, Player\ Chips, Eliminated\ Chips, Current\ Player, Game\ Phase, Step\ Count)

The observation space size is: obs_size=(nrows×nplayers×nmax_pile)+nplayers2+(2×nplayers)+4+1obs\_size = (n_{rows} \times n_{players} \times n_{max\_pile}) + n_{players}^2 + (2 \times n_{players}) + 4 + 1

Action Space

Discrete action space A = {A₀, A₁, ..., A₉}, including:

  • A₀-A₅: Stack selection actions (valid during stack selection phase)
  • A₆-A₉: Player/color decision actions (valid during chip selection, next player selection, and chip elimination phases)

Reward Design

The reward signal at time step t is defined as: rt=min(,(α/nc)t)r_t = \min\left(\wp, \frac{\wp}{(\alpha/n_c) \cdot t}\right)

where α ∈ (0,1] is a hyperparameter controlling the decay rate, and ℘ is the reward magnitude. Illegal actions receive a fixed negative reward (-℘), while legal actions receive positive rewards up to +℘, with this value decaying over steps to promote efficiency.

Experimental Setup

Game Configuration

  • Number of Players: 4 players
  • Initial Chips: 5 chips of the same color per player
  • Maximum Stack Count: 6 active stacks
  • Winning Condition: Zero-sum game with reward structure {0,0,0,ù}, ù ∈ ℕ⁺

Training Configuration

A centralized cumulative learning setup is employed where all four player agents share a common learning network and replay buffer. The network architecture consists of two fully connected hidden layers with 64 neurons (ReLU activation), followed by a linear output layer.

Hyperparameter Settings

  • Discount factor γ = 0.95
  • Initial exploration rate ε₀ = 1.0
  • Exploration decay rate ε_decay = 0.995
  • Minimum exploration rate ε_min = 0.01
  • Learning rate = 0.001
  • Batch size = 64
  • Training episodes = 10,000 games

Evaluation Metrics

  • Mean and standard deviation of cumulative reward
  • Average steps per game
  • Reward range minimum, maximum
  • Step range minimum, maximum

Baseline Methods

  • DQN (Deep Q-Network)
  • DDQN (Double DQN)
  • Dueling DQN
  • Random baseline

Experimental Results

Main Results

AgentReward (Mean ± Std)Reward Range Min, MaxSteps (Mean ± Std)Step Range Min, Max
DQN103.40 ± 42.31-313.45, 189.2461.16 ± 14.5127, 162
DDQN108.44 ± 44.95-279.13, 191.3861.23 ± 14.1828, 165
Dueling DQN102.06 ± 49.62-319.76, 192.0965.92 ± 15.9428, 173
Random-8.78 ± 43.52-419.26, 94.1965.24 ± 17.7629, 174

Key Findings

  1. Performance: All DRL agents consistently outperform the random baseline, achieving approximately half of the theoretical maximum reward (≈200)
  2. Convergence Characteristics: DDQN achieves the most stable convergence and highest average reward, validating the benefits of double estimation in mitigating Q-value overestimation in long-horizon games
  3. Learning Dynamics: Early training phases (<500 games) exhibit high reward variance with frequent illegal actions; after approximately 2000 games, all DRL agents show smoother convergence

Learning Curve Analysis

The training process consists of three stages:

  • Exploration Phase (0-500 games): High variance, frequent illegal actions
  • Learning Phase (500-2000 games): Gradual rule mastery, steady reward increase
  • Convergence Phase (>2000 games): Stable rewards in the 100-120 range with occasional exploratory dips

MARL Benchmark Development

  • Traditional Benchmarks: Go and StarCraft II primarily focus on pure competition or cooperation
  • Social Games: Diplomacy involves negotiation but relies on explicit communication
  • Game-Theoretic Applications: Nash equilibrium solving in multi-agent systems

Deep Reinforcement Learning Applications in Games

  • AlphaGo Series: Breakthroughs in perfect information games
  • Multi-Agent Learning: Self-play training and strategy diversity
  • Value Function Methods: DQN and its variants in discrete action spaces

This paper is the first to establish SLS as a computational benchmark, filling a gap in research on coalition formation and betrayal dynamics.

Conclusions and Discussion

Main Conclusions

  1. Classical value-based methods can learn core SLS rules and partial strategies, achieving stable but suboptimal performance
  2. High reward variance reflects sensitivity to initialization and exploration
  3. Context-dependent actions expose limitations of short-term value estimation
  4. SLS is successfully established as a negotiation-aware MARL benchmark

Limitations

  1. Strategic Limitations: Agents tend to adopt reactive rather than strategic behavior
  2. Rule Compliance: Despite dynamic action masking, agents occasionally execute illegal actions
  3. Long-Term Reasoning: Difficulties with combinatorial action spaces and delayed reward dependencies
  4. Coalition Dynamics: Failure to adequately capture complex coalition formation and betrayal strategies

Future Directions

  1. Architectural Improvements: Integrate actor-critic and coalition-aware frameworks
  2. Strategy Enhancement: Strengthen long-term reasoning and rule compliance
  3. Social Dynamics: Develop negotiation, coalition, and deception capabilities
  4. Theoretical Analysis: Combine game-theoretic reasoning with deep learning

In-Depth Evaluation

Strengths

  1. Innovative Benchmark: First introduction of SLS to MARL, filling an important gap in research on coalition and betrayal dynamics
  2. Complete Framework: Provides a comprehensive computational framework with GUI, promoting reproducible research
  3. Systematic Evaluation: Comprehensive benchmarking of multiple classical DRL methods
  4. Theoretical Contribution: Clarifies zero-sum variant rules, addressing incompleteness in original formalization

Weaknesses

  1. Methodological Limitations: Only classical value-based methods tested; advanced MARL algorithms unexplored
  2. Simplified Setting: Removal of explicit negotiation mechanisms may lose core SLS characteristics
  3. Performance Bottlenecks: Agents still execute illegal actions, exposing fundamental method inadequacies
  4. Insufficient Theoretical Analysis: Lacks deep analysis of SLS game-theoretic properties

Impact

  1. Academic Value: Provides new research directions and benchmarks for the MARL community
  2. Practical Significance: Open-source framework release will facilitate subsequent research
  3. Methodological Contribution: Demonstrates how to convert complex strategic games into ML-friendly environments
  4. Limitation-Driven Insights: Reveals classical RL inadequacies in complex social games, guiding future research

Applicable Scenarios

  1. MARL Research: Algorithm development for coalition formation and betrayal dynamics
  2. Game-Theoretic Applications: Computational models for multi-party negotiation and strategic reasoning
  3. Social AI: Modeling of trust, deception, and cooperative behavior
  4. Educational Tools: Teaching demonstrations for game theory and multi-agent systems

References

  1. Hausner, M., Nash, J., Shapley, L., & Shubik, M. (1964). So Long Sucker- A Four-Person Game
  2. Vinyals, O. et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature
  3. FAIR Team et al. (2022). Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science
  4. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature

By introducing SLS as a novel MARL benchmark, this paper provides a valuable platform for researching coalition formation and strategic deception. While current results demonstrate the limitations of classical methods, this precisely highlights the benchmark's challenging nature and research value, pointing the way toward developing more advanced multi-agent learning algorithms in the future.