2025-11-11T10:25:09.405477

Can Large Language Models Master Complex Card Games?

Wang, Bie, Chen et al.
Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models' ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can achieve a certain level of proficiency in multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs. The code is available at https://github.com/THUDM/LLM4CardGame
academic

Can Large Language Models Master Complex Card Games?

Basic Information

  • Paper ID: 2509.01328
  • Title: Can Large Language Models Master Complex Card Games?
  • Authors: Wei Wang, Fuqing Bie, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang
  • Classification: cs.CL
  • Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2509.01328
  • Code Link: https://github.com/THUDM/LLM4CardGame

Abstract

Complex games have long served as important benchmarks for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero defeated top human players in Go and chess, attracting widespread societal attention to AI. Simultaneously, large language models (LLMs) have demonstrated exceptional capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. This paper explores the potential of LLMs to master complex card games. The research systematically evaluates the learning capabilities of LLMs across eight different card games, assesses the impact of fine-tuning on high-quality game data, and examines the model's ability to maintain general-purpose capabilities while mastering these games.

Research Background and Motivation

Problem Definition

The core research question is: Can large language models master complex card games in the same manner as specialized game AI?

Significance

  1. Exploring AI Capability Boundaries: Complex games serve as important scenarios for testing the limits of AI algorithms, as demonstrated by systems from Deep Blue to the AlphaGo series
  2. Assessing General Intelligence: Compared to specialized game AI, the game-mastering capabilities of LLMs as general-purpose learners hold greater research value
  3. Multi-task Learning Ability: Evaluating whether LLMs can simultaneously master multiple complex games without requiring specially designed network architectures

Limitations of Existing Approaches

  1. Insufficient Evaluation: Existing research primarily employs prompt-based methods without thoroughly assessing LLMs' learning capabilities
  2. Insufficient Task Complexity: Evaluated games lack sufficient complexity to comprehensively test LLMs' learning limits
  3. Single-Game Limitations: Lack of systematic research on LLMs' ability to simultaneously master multiple complex games

Research Motivation

Inspired by the success of the AlphaGo series, this work explores whether LLMs can master complex card games by learning high-quality game trajectory data and evaluates their advantages as general-purpose learners.

Core Contributions

  1. First comprehensive evaluation framework for assessing LLMs' learning capabilities across multiple high-complexity games
  2. Constructed large-scale high-quality training datasets containing eight complex card games, avoiding the high computational costs of learning from scratch
  3. Systematic evaluation of LLMs' performance across three critical dimensions: single-game mastery, multi-game simultaneous learning, and general-purpose capability preservation
  4. Demonstrated that LLMs possess strong learning capabilities and generality, enabling them to simultaneously master multiple complex games without modifying model architecture

Methodology Details

Task Definition

Input: Game state information (hand cards, action history, legal actions, etc.) Output: Game action decisions in JSON format Constraints: Actions must be selected from the set of legal actions

Game Selection and Data Preparation

Game Selection Criteria

Eight card games were selected based on three dimensions:

  1. Popularity: The game's level of popularity
  2. Complexity: Measured by the number of information sets and average information set size
  3. Data Availability: Availability of strong AI models or high-quality data

Selected Games

  • High-Complexity Games: Dou Dizhu, Guan Dan, Japanese Mahjong
  • Medium-Complexity Games: UNO, Gin Rummy
  • Poker-Type Games: Leduc Hold'em, Limit Texas Hold'em, No-Limit Texas Hold'em

Data Generation Pipeline

Trajectory Generation

  1. Teacher Models: Using strong game AI (e.g., DouZero, DanZero) or expert data
  2. Opponent Models: Rule-based models, random models, or other AI models
  3. Game Quantity: Adjusted according to game complexity, ranging from 6k to 400k games

Data Filtering

  1. Winner Filtering: Retaining only observation-action pairs from winning players
  2. Selective Filtering: Retaining only samples with more than one legal action

Instruction Data Generation

Designed game-specific prompt templates containing:

  • Game Introduction: Rules and objectives
  • State Data: Hand cards, community cards, action history, legal actions
  • Output Format: JSON format requirements

Model Training Strategy

Model Selection

  • Multiple Model Types: Qwen2.5, Llama3.1, GLM4
  • Multiple Model Scales: 0.5B to 14B parameters

Training Configuration

  • Fine-tuning Method: LoRA fine-tuning (rank=8, alpha=16)
  • Learning Rate: Peak 1e-4 with cosine scheduling
  • Batch Size: 128
  • Training Epochs: 1 epoch

Experimental Setup

Data Scale

GamePlayersTeacher ModelGame CountAvg StepsTraining Data
Dou Dizhu3DouZero200k37.311,000k
Guan Dan4DanZero6k311.251,000k
Japanese Mahjong4Expert Data7k656.921,000k
UNO2Rule Model50k42.33400k
Gin Rummy2Rule Model50k52.14400k

Evaluation Metrics

  • Dou Dizhu: Win rate
  • Guan Dan: Round win rate
  • Other Games: Reward scores (based on ranking or RLCard framework)

Experimental Design

  1. RQ1: Single-game mastery capability assessment
  2. RQ2: Multi-game simultaneous learning capability assessment
  3. RQ3: General-purpose capability preservation assessment

Experimental Results

Main Results

RQ1: Single-Game Mastery Capability

  • Dou Dizhu: Qwen2.5-7B achieved 80.6% win rate, approaching DouZero's performance
  • Guan Dan: All three models achieved approximately 63% round win rate, approaching DanZero
  • Japanese Mahjong: Achieved performance comparable to strong AI Mortal

Model Scale Impact

  • 0.5B to 7B: Performance improves with increasing parameter count
  • 14B Model Anomaly: Performance actually declined on Dou Dizhu; analysis revealed imbalanced role learning

RQ2: Multi-Game Simultaneous Learning

API Model Comparison:

  • DeepSeek-R1 performed best, achieving highest scores on 3 games
  • Fine-tuned models significantly outperformed API models on complex games (Dou Dizhu, Guan Dan, Mahjong)

Inter-Game Effects:

  • Positive Transfer: Games with similar rules (Dou Dizhu ↔ Guan Dan, among three poker games)
  • Negative Interference: Conflicts between games with large rule differences

RQ3: General-Purpose Capability Preservation

Capability Decline:

  • MMLU-Pro: 47.95→44.74 (Llama3.1)
  • Math-500: 46.60→35.20 (Llama3.1)
  • HumanEval: 70.73→60.98 (Llama3.1)

Capability Recovery: Through further fine-tuning with mixed data (20k knowledge data, 20k math data, 20k programming data, and 8k game data):

  • MMLU-Pro: 44.74→45.18
  • Math-500: 35.20→47.20
  • HumanEval: 60.98→65.24

Ablation Studies

Data Quantity Impact

Model performance on complex games continuously improved with increasing training data, indicating that high-quality data is crucial for LLMs to master complex games.

Model Type Comparison

  • Qwen2.5 and Llama3.1 showed similar performance on most games
  • GLM4 performed poorly on Dou Dizhu, primarily due to imbalanced role learning

Case Analysis

Dou Dizhu Role Learning

Found that GLM4 and 14B models excelled in the landlord role but showed significantly degraded performance in the peasant role. Analysis revealed:

  1. Data Quality Issues: When peasants win, both peasant data are retained, but victory may be primarily contributed by one peasant
  2. Imbalanced Learning: Models focused more on learning the landlord role

Game AI Development

  • Traditional Methods: From Deep Blue to the AlphaGo series, demonstrating AI breakthroughs in complex games
  • Reinforcement Learning: AlphaZero, MuZero, and others achieving superhuman levels through self-play

LLM Game Capability Research

  • Existing Research: Primarily focused on prompt-based method evaluation for games like Texas Hold'em and Blackjack
  • Limitations: Lack of in-depth assessment of LLMs' learning capabilities; insufficient game complexity

Advantages of This Work

  1. Higher Complexity: Selected games possess larger state and action spaces
  2. Learning Capability Assessment: Evaluating genuine learning capabilities through fine-tuning rather than relying solely on pre-training knowledge
  3. Systematic Research: Comprehensive evaluation across multiple games and dimensions

Conclusions and Discussion

Main Conclusions

  1. LLMs possess the capability to master complex card games: Through fine-tuning on high-quality data, they can approach the performance of specialized game AI
  2. Multi-game learning follows patterns: Positive transfer occurs between games with similar rules, while negative interference exists between games with significant rule differences
  3. General-purpose capabilities can be recovered: Although game fine-tuning damages general capabilities, this can be mitigated through mixed training

Limitations

  1. Inference Speed: LLMs require longer inference time compared to specialized game AI
  2. Data Dependency: Requires large quantities of high-quality game data
  3. Role Balance: Imbalanced learning exists in multi-role games
  4. Computational Resources: Training and inference require substantial GPU resources

Future Directions

  1. Efficiency Optimization: Investigating more efficient fine-tuning and inference methods
  2. Self-Play Learning: Exploring LLMs' self-play learning capabilities
  3. Broader Game Coverage: Extending to more types of complex games
  4. Theoretical Analysis: Deeper understanding of knowledge transfer mechanisms between games

In-Depth Evaluation

Strengths

  1. Problem Importance: Studying LLMs' capabilities in complex games holds significant theoretical and practical value
  2. Experimental Comprehensiveness: Systematic evaluation across eight games, three research questions, and multiple models
  3. Methodological Innovation: The approach of avoiding training from scratch by leveraging high-quality data from strong AI is novel
  4. Result Convincingness: Achieved performance approaching specialized AI across multiple complex games
  5. In-Depth Analysis: Thorough investigation of anomalous phenomena (e.g., poor 14B model performance)

Limitations

  1. Game Type Constraints: Limited to card games; does not cover other types of complex games
  2. Insufficient Theoretical Analysis: Lacks theoretical explanation for why LLMs can master complex games
  3. Incomplete Computational Cost Analysis: While computational resources are mentioned, detailed comparison with specialized AI is lacking
  4. Generalization Capability: Did not test performance on unseen game variants

Impact

  1. Academic Contribution: Provides important evidence for LLM applications in complex decision-making tasks
  2. Practical Value: Demonstrates the potential of LLMs as general-purpose game AI
  3. Reproducibility: Complete code and data provided, facilitating subsequent research
  4. Inspirational Significance: Provides reference for LLM applications in other complex decision-making domains

Applicable Scenarios

  1. Game AI Development: Offers new approaches for scenarios requiring rapid development of multiple game AIs
  2. Multi-Task Learning: Provides benchmarks for studying LLMs' multi-task learning capabilities
  3. Decision Systems: Provides methodological reference for developing complex decision systems
  4. AI Capability Assessment: Provides new tools for evaluating general AI systems' complex reasoning abilities

References

This paper cites 46 important references spanning multiple domains including the development history of game AI, large language model research, and reinforcement learning methods, providing a solid theoretical foundation for the research.