2025-11-11T10:25:09.405477

Can Large Language Models Master Complex Card Games?

Wang, Bie, Chen et al.

Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models' ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can achieve a certain level of proficiency in multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs. The code is available at https://github.com/THUDM/LLM4CardGame

academic

Can Large Language Models Master Complex Card Games?

Basic Information

Paper ID: 2509.01328
Title: Can Large Language Models Master Complex Card Games?
Authors: Wei Wang, Fuqing Bie, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang
Classification: cs.CL
Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2509.01328
Code Link: https://github.com/THUDM/LLM4CardGame

Abstract

Complex games have long served as important benchmarks for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero defeated top human players in Go and chess, attracting widespread societal attention to AI. Simultaneously, large language models (LLMs) have demonstrated exceptional capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. This paper explores the potential of LLMs to master complex card games. The research systematically evaluates the learning capabilities of LLMs across eight different card games, assesses the impact of fine-tuning on high-quality game data, and examines the model's ability to maintain general-purpose capabilities while mastering these games.

Research Background and Motivation

Problem Definition

The core research question is: Can large language models master complex card games in the same manner as specialized game AI?

Significance

Exploring AI Capability Boundaries: Complex games serve as important scenarios for testing the limits of AI algorithms, as demonstrated by systems from Deep Blue to the AlphaGo series
Assessing General Intelligence: Compared to specialized game AI, the game-mastering capabilities of LLMs as general-purpose learners hold greater research value
Multi-task Learning Ability: Evaluating whether LLMs can simultaneously master multiple complex games without requiring specially designed network architectures

Limitations of Existing Approaches

Insufficient Evaluation: Existing research primarily employs prompt-based methods without thoroughly assessing LLMs' learning capabilities
Insufficient Task Complexity: Evaluated games lack sufficient complexity to comprehensively test LLMs' learning limits
Single-Game Limitations: Lack of systematic research on LLMs' ability to simultaneously master multiple complex games

Research Motivation

Inspired by the success of the AlphaGo series, this work explores whether LLMs can master complex card games by learning high-quality game trajectory data and evaluates their advantages as general-purpose learners.

Core Contributions

First comprehensive evaluation framework for assessing LLMs' learning capabilities across multiple high-complexity games
Constructed large-scale high-quality training datasets containing eight complex card games, avoiding the high computational costs of learning from scratch
Systematic evaluation of LLMs' performance across three critical dimensions: single-game mastery, multi-game simultaneous learning, and general-purpose capability preservation
Demonstrated that LLMs possess strong learning capabilities and generality, enabling them to simultaneously master multiple complex games without modifying model architecture

Methodology Details

Task Definition

Input: Game state information (hand cards, action history, legal actions, etc.) Output: Game action decisions in JSON format Constraints: Actions must be selected from the set of legal actions

Game Selection and Data Preparation

Game Selection Criteria

Eight card games were selected based on three dimensions:

Popularity: The game's level of popularity
Complexity: Measured by the number of information sets and average information set size
Data Availability: Availability of strong AI models or high-quality data

Selected Games

High-Complexity Games: Dou Dizhu, Guan Dan, Japanese Mahjong
Medium-Complexity Games: UNO, Gin Rummy
Poker-Type Games: Leduc Hold'em, Limit Texas Hold'em, No-Limit Texas Hold'em

Data Generation Pipeline

Trajectory Generation

Teacher Models: Using strong game AI (e.g., DouZero, DanZero) or expert data
Opponent Models: Rule-based models, random models, or other AI models
Game Quantity: Adjusted according to game complexity, ranging from 6k to 400k games

Data Filtering

Winner Filtering: Retaining only observation-action pairs from winning players
Selective Filtering: Retaining only samples with more than one legal action

Instruction Data Generation

Designed game-specific prompt templates containing:

Game Introduction: Rules and objectives
State Data: Hand cards, community cards, action history, legal actions
Output Format: JSON format requirements

Model Training Strategy

Model Selection

Multiple Model Types: Qwen2.5, Llama3.1, GLM4
Multiple Model Scales: 0.5B to 14B parameters

Training Configuration

Fine-tuning Method: LoRA fine-tuning (rank=8, alpha=16)
Learning Rate: Peak 1e-4 with cosine scheduling
Batch Size: 128
Training Epochs: 1 epoch

Experimental Setup

Data Scale

Game	Players	Teacher Model	Game Count	Avg Steps	Training Data
Dou Dizhu	3	DouZero	200k	37.31	1,000k
Guan Dan	4	DanZero	6k	311.25	1,000k
Japanese Mahjong	4	Expert Data	7k	656.92	1,000k
UNO	2	Rule Model	50k	42.33	400k
Gin Rummy	2	Rule Model	50k	52.14	400k

Evaluation Metrics

Dou Dizhu: Win rate
Guan Dan: Round win rate
Other Games: Reward scores (based on ranking or RLCard framework)

Experimental Design

RQ1: Single-game mastery capability assessment
RQ2: Multi-game simultaneous learning capability assessment
RQ3: General-purpose capability preservation assessment

Experimental Results

Main Results

RQ1: Single-Game Mastery Capability

Dou Dizhu: Qwen2.5-7B achieved 80.6% win rate, approaching DouZero's performance
Guan Dan: All three models achieved approximately 63% round win rate, approaching DanZero
Japanese Mahjong: Achieved performance comparable to strong AI Mortal

Model Scale Impact

0.5B to 7B: Performance improves with increasing parameter count
14B Model Anomaly: Performance actually declined on Dou Dizhu; analysis revealed imbalanced role learning

RQ2: Multi-Game Simultaneous Learning

API Model Comparison:

DeepSeek-R1 performed best, achieving highest scores on 3 games
Fine-tuned models significantly outperformed API models on complex games (Dou Dizhu, Guan Dan, Mahjong)

Inter-Game Effects:

Positive Transfer: Games with similar rules (Dou Dizhu ↔ Guan Dan, among three poker games)
Negative Interference: Conflicts between games with large rule differences

RQ3: General-Purpose Capability Preservation

Capability Decline:

MMLU-Pro: 47.95→44.74 (Llama3.1)
Math-500: 46.60→35.20 (Llama3.1)
HumanEval: 70.73→60.98 (Llama3.1)

Capability Recovery: Through further fine-tuning with mixed data (20k knowledge data, 20k math data, 20k programming data, and 8k game data):

MMLU-Pro: 44.74→45.18
Math-500: 35.20→47.20
HumanEval: 60.98→65.24

Ablation Studies

Data Quantity Impact

Model performance on complex games continuously improved with increasing training data, indicating that high-quality data is crucial for LLMs to master complex games.

Model Type Comparison

Qwen2.5 and Llama3.1 showed similar performance on most games
GLM4 performed poorly on Dou Dizhu, primarily due to imbalanced role learning

Case Analysis

Dou Dizhu Role Learning

Found that GLM4 and 14B models excelled in the landlord role but showed significantly degraded performance in the peasant role. Analysis revealed:

Data Quality Issues: When peasants win, both peasant data are retained, but victory may be primarily contributed by one peasant
Imbalanced Learning: Models focused more on learning the landlord role

Game AI Development

Traditional Methods: From Deep Blue to the AlphaGo series, demonstrating AI breakthroughs in complex games
Reinforcement Learning: AlphaZero, MuZero, and others achieving superhuman levels through self-play

LLM Game Capability Research

Existing Research: Primarily focused on prompt-based method evaluation for games like Texas Hold'em and Blackjack
Limitations: Lack of in-depth assessment of LLMs' learning capabilities; insufficient game complexity

Advantages of This Work

Higher Complexity: Selected games possess larger state and action spaces
Learning Capability Assessment: Evaluating genuine learning capabilities through fine-tuning rather than relying solely on pre-training knowledge
Systematic Research: Comprehensive evaluation across multiple games and dimensions

Conclusions and Discussion

Main Conclusions

LLMs possess the capability to master complex card games: Through fine-tuning on high-quality data, they can approach the performance of specialized game AI
Multi-game learning follows patterns: Positive transfer occurs between games with similar rules, while negative interference exists between games with significant rule differences
General-purpose capabilities can be recovered: Although game fine-tuning damages general capabilities, this can be mitigated through mixed training

Limitations

Inference Speed: LLMs require longer inference time compared to specialized game AI
Data Dependency: Requires large quantities of high-quality game data
Role Balance: Imbalanced learning exists in multi-role games
Computational Resources: Training and inference require substantial GPU resources

Future Directions

Efficiency Optimization: Investigating more efficient fine-tuning and inference methods
Self-Play Learning: Exploring LLMs' self-play learning capabilities
Broader Game Coverage: Extending to more types of complex games
Theoretical Analysis: Deeper understanding of knowledge transfer mechanisms between games

In-Depth Evaluation

Strengths

Problem Importance: Studying LLMs' capabilities in complex games holds significant theoretical and practical value
Experimental Comprehensiveness: Systematic evaluation across eight games, three research questions, and multiple models
Methodological Innovation: The approach of avoiding training from scratch by leveraging high-quality data from strong AI is novel
Result Convincingness: Achieved performance approaching specialized AI across multiple complex games
In-Depth Analysis: Thorough investigation of anomalous phenomena (e.g., poor 14B model performance)

Limitations

Game Type Constraints: Limited to card games; does not cover other types of complex games
Insufficient Theoretical Analysis: Lacks theoretical explanation for why LLMs can master complex games
Incomplete Computational Cost Analysis: While computational resources are mentioned, detailed comparison with specialized AI is lacking
Generalization Capability: Did not test performance on unseen game variants

Impact

Academic Contribution: Provides important evidence for LLM applications in complex decision-making tasks
Practical Value: Demonstrates the potential of LLMs as general-purpose game AI
Reproducibility: Complete code and data provided, facilitating subsequent research
Inspirational Significance: Provides reference for LLM applications in other complex decision-making domains

Applicable Scenarios

Game AI Development: Offers new approaches for scenarios requiring rapid development of multiple game AIs
Multi-Task Learning: Provides benchmarks for studying LLMs' multi-task learning capabilities
Decision Systems: Provides methodological reference for developing complex decision systems
AI Capability Assessment: Provides new tools for evaluating general AI systems' complex reasoning abilities

References

This paper cites 46 important references spanning multiple domains including the development history of game AI, large language model research, and reinforcement learning methods, providing a solid theoretical foundation for the research.