2025-11-11T10:25:09.405477

Can Large Language Models Master Complex Card Games?

Wang, Bie, Chen et al.

Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models' ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can achieve a certain level of proficiency in multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs. The code is available at https://github.com/THUDM/LLM4CardGame

academic

Can Large Language Models Master Complex Card Games?

基本信息

论文ID: 2509.01328
标题: Can Large Language Models Master Complex Card Games?
作者: Wei Wang, Fuqing Bie, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang
分类: cs.CL
发表会议: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
论文链接: https://arxiv.org/abs/2509.01328
代码链接: https://github.com/THUDM/LLM4CardGame

AI能力边界探索：复杂游戏是测试AI算法上限的重要场景，从Deep Blue到AlphaGo系列都证明了这一点
通用智能评估：相比专门的游戏AI，LLMs作为通用学习器的游戏掌握能力更具研究价值
多任务学习能力：评估LLMs能否同时掌握多个复杂游戏而不需要专门设计的网络架构

现有方法局限性

评估不充分：现有研究多采用基于提示的方法，未充分评估LLMs的学习能力
任务复杂度不足：评估的游戏复杂度较低，无法全面测试LLMs的学习上限
单一游戏局限：缺乏对LLMs同时掌握多个复杂游戏能力的系统性研究

研究动机

受AlphaGo系列成功的启发，探索LLMs是否能通过学习高质量的游戏轨迹数据来掌握复杂卡牌游戏，并评估其作为通用学习器的优势。

核心贡献

首次提出了对LLMs在多个高复杂度游戏中学习能力的全面评估框架
构建了包含八种复杂卡牌游戏的大规模高质量训练数据集，避免了从零开始学习的高计算成本
系统评估了LLMs在三个关键维度的表现：单游戏掌握能力、多游戏同时学习能力、通用能力保持能力
证明了LLMs具有强大的学习能力和通用性，能够在不改变模型结构的情况下同时掌握多个复杂游戏

流行度：游戏的受欢迎程度
复杂度：通过信息集数量和平均信息集大小衡量
数据可获得性：是否有强AI模型或高质量数据

选定游戏

高复杂度游戏：斗地主、掼蛋、日本麻将
中等复杂度游戏：UNO、金拉米
扑克类游戏：Leduc Hold'em、限注德州扑克、无限注德州扑克

数据生成流程

轨迹生成

教师模型：使用强游戏AI（如DouZero、DanZero）或专家数据
对手模型：规则模型、随机模型或其他AI模型
游戏数量：根据游戏复杂度调整，从6k到400k场不等

数据过滤

胜者过滤：只保留获胜方的观察-动作对
选择性过滤：只保留合法动作数量大于1的样本

指令数据生成

设计游戏特定的提示模板，包含：

游戏介绍：规则和目标
状态数据：手牌、公共牌、历史动作、合法动作
输出格式：JSON格式要求

模型训练策略

模型选择

多类型模型：Qwen2.5、Llama3.1、GLM4
多尺度模型：0.5B到14B参数

训练配置

微调方法：LoRA微调（rank=8, alpha=16）
学习率：峰值1e-4，余弦调度
批次大小：128
训练轮数：1 epoch

实验设置

数据规模

游戏	玩家数	教师模型	游戏场次	平均步数	训练数据
斗地主	3	DouZero	200k	37.31	1,000k
掼蛋	4	DanZero	6k	311.25	1,000k
日本麻将	4	专家数据	7k	656.92	1,000k
UNO	2	规则模型	50k	42.33	400k
金拉米	2	规则模型	50k	52.14	400k