The emergence of large language models (LLMs) has opened new opportunities for cre- ating dynamic non-player characters (NPCs) in gaming environments, enabling both func- tional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which eval- uates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).
Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
- Paper ID: 2510.13586
- Title: Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
- Authors: Pasin Buakhaw, Kun Kerdthaisong, Phuree Phenhiran, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot
- Categories: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
- Publication Date: October 26, 2025
- Paper Link: https://arxiv.org/abs/2510.13586v3
The emergence of Large Language Models (LLMs) has opened new opportunities for creating dynamic non-player characters (NPCs) in game environments, enabling simultaneous achievement of functional task execution and character-consistent dialogue generation. This paper reports the participation of team TU_Character_lab in Round 2 of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025, which evaluates agent performance across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. The research methodology combines two complementary strategies: (1) lightweight prompting techniques in the API track, including deflanderization prompting methods that suppress excessive role-playing and enhance task fidelity; (2) fine-tuned large models in the GPU track, utilizing Qwen3-14B for supervised fine-tuning (SFT) and low-rank adaptation (LoRA). The best submissions ranked 2nd in Task 1, 2nd in Task 3 (API track), and 4th in Task 3 (GPU track).
Traditional game development heavily relies on pre-programmed logic, with in-game events and character interactions following preset scripts and dialogue trees. To enhance player immersion and narrative depth, developers have begun incorporating LLMs as core components of NPCs, enabling them to exhibit human-like behavior and engage in dynamic, context-aware dialogue with players.
Maintaining consistency and depth of dynamic characters over extended interactions presents significant challenges, particularly the phenomenon of "Flanderization." This term, derived from the character Ned Flanders in The Simpsons, refers to the gradual simplification of complex characters over time, ultimately becoming caricatured figures defined by a single exaggerated trait.
- Balancing Character Authenticity with Task Execution: Existing LLM-driven NPCs often neglect functional correctness when engaging in excessive role-playing
- Long-term Dialogue Consistency: Maintaining character coherence across extended conversations
- Multi-task Integration: Addressing the challenge of simultaneously handling task-oriented dialogue and character-consistent dialogue
- Proposed Deflanderization Prompting Technique: Suppresses excessive role-playing and achieves balance between dialogue generation and functional execution capabilities
- Explored Complementary Strategies of Lightweight Prompting and Fine-tuning: Prompt engineering for the API track and model fine-tuning for the GPU track
- Constructed Hybrid RAG+Memory Method: Combines retrieval-augmented generation and memory mechanisms to enhance dialogue foundations
- Achieved Excellent Results in CPDC 2025 Competition: Secured top rankings in multiple tasks, validating method effectiveness
The CPDC competition comprises three tasks:
- Task 1: Task-oriented Dialogue Agent - Evaluates correctness of function calls and accuracy of parameter selection
- Task 2: Context-aware Dialogue Agent - Evaluates NPC response consistency with specified character personas
- Task 3: Integrated Context Dialogue and Task Execution - Combines Tasks 1 and 2
The core idea is to guide the model to respond naturally and concisely, avoiding exaggerated role-playing. Error analysis revealed that baseline settings frequently produced overly detailed and contextually scattered outputs, focusing excessively on narrative setting rather than directly addressing player requests.
Primary Prompting Techniques:
- D (Deflanderization): Prompts the model to avoid excessive role-playing
- F (Few-shot): Includes two sample dialogues (merchant and guild receptionist)
- CoT (Chain of Thought): Guides the model to think step-by-step
- RW (Remove World Setting): Removes world-building information when constructing dialogue prompts
- G (Guide): Restricts responses to 1-2 short sentences using simple language
As shown in Figure 2, the API track employs a five-step pipeline:
- Prepare function call prompts
- Function generation (API call #1)
- Execute functions
- Prepare dialogue prompts
- Dialogue generation (API call #2)
Given computational constraints (AWS g5e.2xlarge instance with L40s GPU), models capable of running in this environment were selected, ultimately choosing Qwen3-14B as the primary model.
Fine-tuning Strategy:
- Full SFT: Supervised fine-tuning on initial and synthetic multi-turn dialogue data
- LoRA Fine-tuning: Low-rank adaptation on dialogue and function call datasets (rank=32, α=32)
- Retrieval Module: Uses Qwen3-Embedding-0.6B to encode player and NPC dialogue history
- Injection Phase: Injects retrieved context at two stages—function selection and dialogue drafting
- RAG+Refine: Rewrites generated drafts to match tone and length of high-similarity gold responses
Generated function call data using gemini-2.5-pro-preview and dialogue data using GPT-4o-mini:
- Multi-turn dialogue: 2,800 data points
- Multi-turn reasoning: 2,800 data points (Task 2)
- Function call generation: 328 data points (Task 1)
- Task 1: train.json, sample.json - Function call data
- Task 2: train.json, sample.json - Character dialogue data
- Data analysis shows balanced NPC character distribution (20 merchants, 20 guild receptionists)
- Function Name Exact Match: Accuracy of predicted function names exactly matching references
- Function Parameter Exact Match: Accuracy of all predicted parameters exactly matching references
- BERTScore: Measures semantic similarity using BERT embeddings
- BLEU-4: Scoring based on modified n-gram precision
- Word-level F1: F1 score based on vocabulary sets
- CPDCscore: Weighted composite score of WordF1, BLEU, USEScore, and BERTScore
- API Track: GPT-4o-mini, maximum 2 API calls per turn, input limit 2000 tokens, output limit 200 tokens
- GPU Track: vLLM framework deployment, dtype='bfloat16', gpu_memory_utilization=0.8
| Task | Method | CPDCscore |
|---|
| Task 1 | ZeroShot | 0.422 |
| Task 1 | Best Method (D+RW) | 0.586 |
| Task 3 | ZeroShot | 0.510 |
| Task 3 | Best Method | 0.601 |
Key Findings:
- Significant Deflanderization Effect: D strategy achieved +0.013 CPDCscore improvement over zero-shot baseline on Task 3
- Further Enhancement with Few-shot Prompting: Adding few-shot examples (F) yielded +0.092 and +0.133 improvements on Task 1
- Limited Gains from Complex Prompting: Complex strategies like CoT and guided responses showed marginal or inconsistent benefits
| Model | Method | Task 1 Score | Task 2 Score | Total Score |
|---|
| LLaMA3.1-8B | Baseline | 0.439 | 0.333 | 0.386 |
| Qwen3-14B | SFT + LoRA | 0.590 | 0.606 | 0.598 |
Key Findings:
- Model Scale and Fine-tuning are Critical: Qwen3-14B with SFT and LoRA achieved 0.598 total score, ranking 4th
- Retrieval Enhancement Provides Moderate Improvement: RAG method improved Qwen3-8B performance to 0.522
- Task Trade-offs: RAG+Refine performed best on Task 1 but degraded on Task 2, while LoRA-SFT achieved better balance
Systematic ablation experiments validated component contributions:
- Deflanderization vs. standard prompting
- Few-shot learning vs. zero-shot learning
- Comparison of different retrieval strategies
- SFT vs. LoRA vs. combined methods
- Task-oriented Systems: Such as Kazi et al. (2024) evaluating agent planning effectiveness and goal alignment
- Game Assistants: Lee et al. (2025) developing specialized game assistants for novice players
- Multi-agent Frameworks: Phillips et al. (2025) using dialogue agents and goal verification agents
- Function Call Architecture: Multi-step frameworks including execution, perception, verification, control, and retrieval components
- Evaluation Benchmarks: τ2-Bench introducing dual-control environments for evaluating agent coordination
- User Personalization: LaMP and similar benchmarks evaluating personalized text generation
- Environment Adaptation: Role-playing in multi-agent systems like ChatDev and MetaGPT
- Lightweight Deflanderization Strategy is Effective: Significantly improves performance in API settings by suppressing excessive role-playing
- Fine-tuned Large Models Excel in GPU Track: Qwen3-14B with SFT and LoRA achieves optimal results
- Task Balance is a Key Challenge: Methods improving role-playing fidelity sometimes compromise parameter correctness
- Computational Resource Constraints: GPU track limited by L40s memory budget, restricting use of larger models
- Retrieval Corpus Scale: RAG methods constrained by retrieval corpus size and quality
- Evaluation Metric Limitations: Automatic evaluation metrics cannot fully reflect dialogue system quality; human evaluation is needed
- Hybrid Strategy Exploration: Unified hybrid strategies combining lightweight prompting with retrieval-augmented fine-tuning
- Long-term Consistency: Methods for maintaining character consistency across longer conversations
- Multimodal Extension: Multimodal NPC systems incorporating visual and audio information
- Clear Problem Definition: The introduction of the Flanderization concept is novel and accurately describes a key problem in LLM role-playing
- Strong Method Complementarity: Different but complementary strategies for API and GPU tracks demonstrate comprehensive technical perspective
- Comprehensive Experiments: Systematic ablation studies and multi-dimensional evaluation validate method effectiveness
- High Practical Value: Excellent competition results demonstrate practical applicability
- Insufficient Theoretical Analysis: Lacks deep theoretical analysis of the Flanderization phenomenon
- Unverified Generalization: Methods primarily validated on CPDC dataset; generalization to other game scenarios unverified
- Missing Computational Efficiency Analysis: Lacks detailed analysis of computational costs and inference efficiency for different methods
- Insufficient User Experience Evaluation: Lacks subjective experience assessment from real players
- Academic Contribution: Introduces new research directions and solutions to the game AI field
- Practical Value: Methods directly applicable to NPC design in game development
- Reproducibility: Provides detailed implementation details and prompt templates facilitating reproduction
- RPG Games: Particularly suitable for role-playing games requiring rich character interactions
- Educational Games: Can be used to create intelligent tutoring assistants and virtual mentors
- Social Platforms: Extensible to chatbots on social platforms like Discord
- Kazi et al. (2024): Large language models as user-agents for evaluating task-oriented-dialogue systems
- Lee et al. (2025): AMAN: Agent for mentoring and assisting newbies in MMORPG
- Phillips et al. (2025): Goal-oriented interactions in games using llms
- Park et al. (2023): Generative agents: Interactive simulacra of human behavior
- Sony AI (2025): The commonsense persona-grounded dialogue challenge 2025
This paper presents an innovative solution in the game AI field, effectively balancing NPC character authenticity with task execution capability through Deflanderization technology, providing important reference for future intelligent character design in games.