2025-11-16T22:28:12.942550

Reinforcing Competitive Multi-Agents for Playing 'So Long Sucker'

Sharan, Adak

This paper investigates the strategy game So Long Sucker (SLS) as a novel benchmark for multi-agent reinforcement learning (MARL). Unlike traditional board or video game testbeds, SLS is distinguished by its coalition formation, strategic deception, and dynamic elimination rules, making it a uniquely challenging environment for autonomous agents. We introduce the first publicly available computational framework for SLS, complete with a graphical user interface and benchmarking support for reinforcement learning algorithms. Using classical deep reinforcement learning methods (e.g., DQN, DDQN, and Dueling DQN), we train self-playing agents to learn the rules and basic strategies of SLS. Experimental results demonstrate that, although these agents achieve roughly half of the maximum attainable reward and consistently outperform random baselines, they require long training horizons (~2000 games) and still commit occasional illegal moves, highlighting both the promise and limitations of classical reinforcement learning. Our findings establish SLS as a negotiation-aware benchmark for MARL, opening avenues for future research that integrates game-theoretic reasoning, coalition-aware strategies, and advanced reinforcement learning architectures to better capture the social and adversarial dynamics of complex multi-agent games.

academic

Reinforcing Competitive Multi-Agents for Playing 'So Long Sucker'

Basic Information

Paper ID: 2411.11057
Title: Reinforcing Competitive Multi-Agents for Playing 'So Long Sucker'
Authors: Medant Sharan (King's College London), Chandranath Adak (IIT Patna)
Classification: cs.AI
Publication Date: November 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2411.11057

Abstract

This paper introduces the strategic game "So Long Sucker" (SLS) to the multi-agent reinforcement learning (MARL) domain as a novel benchmark for the first time. Unlike traditional board games or video game testing platforms, SLS features coalition formation, strategic deception, and dynamic elimination rules, providing a unique challenging environment for autonomous agents. The researchers developed the first publicly available computational framework for SLS, complete with a graphical user interface and reinforcement learning algorithm benchmarking support. Classical deep reinforcement learning methods (DQN, DDQN, Dueling DQN) were employed to train self-play agents to learn SLS rules and basic strategies. Experimental results demonstrate that while these agents achieve approximately half of the maximum obtainable reward and consistently outperform random baselines, they require extended training periods (approximately 2000 games) and still occasionally execute illegal actions, highlighting both the potential and limitations of classical reinforcement learning.

Research Background and Motivation

Problem Definition

Existing multi-agent reinforcement learning benchmarks primarily focus on purely cooperative objectives (such as coordination tasks) or adversarial competition (such as two-player zero-sum games), lacking mixed environments capable of simultaneously capturing coalition formation and betrayal dynamics. Although breakthroughs have been achieved in domains such as Go, StarCraft II, and Diplomacy, these benchmarks do not adequately reflect the unique coalition and betrayal dynamics inherent to SLS.

Research Significance

SLS, designed by Hausner, Nash, Shapley, and Shubik as a four-player strategic game, revolves around coalition formation, temporary alliances, and inevitable betrayal. Victory depends not only on legal moves but also on diplomacy and opportunism, making it a unique testing platform for studying trust, negotiation, and social dilemmas.

Limitations of Existing Approaches

Most MARL benchmarks lack the mixed dynamics of coalition and betrayal
Prior work on socially rich settings typically relies on explicit communication channels or hand-crafted interaction rules
SLS has not previously been studied as a computational benchmark

Research Motivation

By formalizing SLS as a reproducible sequential variant and benchmarking baseline DRL algorithms, this paper positions SLS as a coalition and betrayal-aware testing platform for advancing MARL research.

Core Contributions

First SLS Computational Framework: Designed and released the first computational framework for SLS specifically tailored for reinforcement learning research, equipped with a GUI for experimentation
Classical DRL Algorithm Benchmarking: Benchmarked classical DRL algorithms (DQN, DDQN, Dueling DQN) in SLS, analyzing their capacity to achieve legal game proficiency and partial strategic awareness
Coalition and Betrayal-Aware Benchmark: Established SLS as a coalition and betrayal-aware benchmark for MARL, inspiring future research combining DRL with game-theoretic reasoning

Methodology Details

Task Definition

SLS is converted into a MARL environment using the generalized Hofstra version of the zero-sum variant. Four players, each assigned a unique color, start with 5 chips of their color and play on a board with at most 6 active stacks. The winning condition is to be the last surviving player.

Reinforcement Learning Formalization

SLS is modeled as a Markov Decision Process (MDP):

State Space S: The set of all possible game states
Action Space A: The set of all available actions for agents (discrete valid moves)
Transition Function: p(s'|s,a) represents the probability of transitioning to s' after executing action a in state s
Reward Function: r(s,a,s') assigns a scalar value to each transition
Policy: π(a|s) is the agent's policy for selecting action a given state s

The objective is to find the optimal policy π* that maximizes expected discounted return: $R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$

State Representation

State s_t encodes all information necessary to describe the game environment: $s_t = (Board\ Configuration, Player\ Chips, Eliminated\ Chips, Current\ Player, Game\ Phase, Step\ Count)$

The observation space size is: $obs\_size = (n_{rows} \times n_{players} \times n_{max\_pile}) + n_{players}^2 + (2 \times n_{players}) + 4 + 1$

Action Space

Discrete action space A = {A₀, A₁, ..., A₉}, including:

A₀-A₅: Stack selection actions (valid during stack selection phase)
A₆-A₉: Player/color decision actions (valid during chip selection, next player selection, and chip elimination phases)

Reward Design

The reward signal at time step t is defined as: $r_t = \min\left(\wp, \frac{\wp}{(\alpha/n_c) \cdot t}\right)$

where α ∈ (0,1] is a hyperparameter controlling the decay rate, and ℘ is the reward magnitude. Illegal actions receive a fixed negative reward (-℘), while legal actions receive positive rewards up to +℘, with this value decaying over steps to promote efficiency.

Experimental Setup

Game Configuration

Number of Players: 4 players
Initial Chips: 5 chips of the same color per player
Maximum Stack Count: 6 active stacks
Winning Condition: Zero-sum game with reward structure {0,0,0,ù}, ù ∈ ℕ⁺

Training Configuration

A centralized cumulative learning setup is employed where all four player agents share a common learning network and replay buffer. The network architecture consists of two fully connected hidden layers with 64 neurons (ReLU activation), followed by a linear output layer.

Hyperparameter Settings

Discount factor γ = 0.95
Initial exploration rate ε₀ = 1.0
Exploration decay rate ε_decay = 0.995
Minimum exploration rate ε_min = 0.01
Learning rate = 0.001
Batch size = 64
Training episodes = 10,000 games

Evaluation Metrics

Mean and standard deviation of cumulative reward
Average steps per game
Reward range minimum, maximum
Step range minimum, maximum

Baseline Methods

DQN (Deep Q-Network)
DDQN (Double DQN)
Dueling DQN
Random baseline

Experimental Results

Main Results

Agent	Reward (Mean ± Std)	Reward Range Min, Max	Steps (Mean ± Std)	Step Range Min, Max
DQN	103.40 ± 42.31	-313.45, 189.24	61.16 ± 14.51	27, 162
DDQN	108.44 ± 44.95	-279.13, 191.38	61.23 ± 14.18	28, 165
Dueling DQN	102.06 ± 49.62	-319.76, 192.09	65.92 ± 15.94	28, 173
Random	-8.78 ± 43.52	-419.26, 94.19	65.24 ± 17.76	29, 174

Key Findings

Performance: All DRL agents consistently outperform the random baseline, achieving approximately half of the theoretical maximum reward (≈200)
Convergence Characteristics: DDQN achieves the most stable convergence and highest average reward, validating the benefits of double estimation in mitigating Q-value overestimation in long-horizon games
Learning Dynamics: Early training phases (<500 games) exhibit high reward variance with frequent illegal actions; after approximately 2000 games, all DRL agents show smoother convergence

Learning Curve Analysis

The training process consists of three stages:

Exploration Phase (0-500 games): High variance, frequent illegal actions
Learning Phase (500-2000 games): Gradual rule mastery, steady reward increase
Convergence Phase (>2000 games): Stable rewards in the 100-120 range with occasional exploratory dips

MARL Benchmark Development

Traditional Benchmarks: Go and StarCraft II primarily focus on pure competition or cooperation
Social Games: Diplomacy involves negotiation but relies on explicit communication
Game-Theoretic Applications: Nash equilibrium solving in multi-agent systems

Deep Reinforcement Learning Applications in Games

AlphaGo Series: Breakthroughs in perfect information games
Multi-Agent Learning: Self-play training and strategy diversity
Value Function Methods: DQN and its variants in discrete action spaces

This paper is the first to establish SLS as a computational benchmark, filling a gap in research on coalition formation and betrayal dynamics.

Conclusions and Discussion

Main Conclusions

Classical value-based methods can learn core SLS rules and partial strategies, achieving stable but suboptimal performance
High reward variance reflects sensitivity to initialization and exploration
Context-dependent actions expose limitations of short-term value estimation
SLS is successfully established as a negotiation-aware MARL benchmark

Limitations

Strategic Limitations: Agents tend to adopt reactive rather than strategic behavior
Rule Compliance: Despite dynamic action masking, agents occasionally execute illegal actions
Long-Term Reasoning: Difficulties with combinatorial action spaces and delayed reward dependencies
Coalition Dynamics: Failure to adequately capture complex coalition formation and betrayal strategies

Future Directions

Architectural Improvements: Integrate actor-critic and coalition-aware frameworks
Strategy Enhancement: Strengthen long-term reasoning and rule compliance
Social Dynamics: Develop negotiation, coalition, and deception capabilities
Theoretical Analysis: Combine game-theoretic reasoning with deep learning

In-Depth Evaluation

Strengths

Innovative Benchmark: First introduction of SLS to MARL, filling an important gap in research on coalition and betrayal dynamics
Complete Framework: Provides a comprehensive computational framework with GUI, promoting reproducible research
Systematic Evaluation: Comprehensive benchmarking of multiple classical DRL methods
Theoretical Contribution: Clarifies zero-sum variant rules, addressing incompleteness in original formalization

Weaknesses

Methodological Limitations: Only classical value-based methods tested; advanced MARL algorithms unexplored
Simplified Setting: Removal of explicit negotiation mechanisms may lose core SLS characteristics
Performance Bottlenecks: Agents still execute illegal actions, exposing fundamental method inadequacies
Insufficient Theoretical Analysis: Lacks deep analysis of SLS game-theoretic properties

Impact

Academic Value: Provides new research directions and benchmarks for the MARL community
Practical Significance: Open-source framework release will facilitate subsequent research
Methodological Contribution: Demonstrates how to convert complex strategic games into ML-friendly environments
Limitation-Driven Insights: Reveals classical RL inadequacies in complex social games, guiding future research

Applicable Scenarios

MARL Research: Algorithm development for coalition formation and betrayal dynamics
Game-Theoretic Applications: Computational models for multi-party negotiation and strategic reasoning
Social AI: Modeling of trust, deception, and cooperative behavior
Educational Tools: Teaching demonstrations for game theory and multi-agent systems

References

Hausner, M., Nash, J., Shapley, L., & Shubik, M. (1964). So Long Sucker- A Four-Person Game
Vinyals, O. et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature
FAIR Team et al. (2022). Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature

By introducing SLS as a novel MARL benchmark, this paper provides a valuable platform for researching coalition formation and strategic deception. While current results demonstrate the limitations of classical methods, this precisely highlights the benchmark's challenging nature and research value, pointing the way toward developing more advanced multi-agent learning algorithms in the future.