2025-11-24T17:52:17.819931

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Buakhaw, Kerdthaisong, Phenhiran et al.

The emergence of large language models (LLMs) has opened new opportunities for cre- ating dynamic non-player characters (NPCs) in gaming environments, enabling both func- tional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which eval- uates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).

academic

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Basic Information

Paper ID: 2510.13586
Title: Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
Authors: Pasin Buakhaw, Kun Kerdthaisong, Phuree Phenhiran, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot
Categories: cs.CL (Computational Linguistics), cs.AI (Artificial Intelligence)
Publication Date: October 26, 2025
Paper Link: https://arxiv.org/abs/2510.13586v3

Abstract

The emergence of Large Language Models (LLMs) has opened new opportunities for creating dynamic non-player characters (NPCs) in game environments, enabling simultaneous achievement of functional task execution and character-consistent dialogue generation. This paper reports the participation of team TU_Character_lab in Round 2 of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025, which evaluates agent performance across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. The research methodology combines two complementary strategies: (1) lightweight prompting techniques in the API track, including deflanderization prompting methods that suppress excessive role-playing and enhance task fidelity; (2) fine-tuned large models in the GPU track, utilizing Qwen3-14B for supervised fine-tuning (SFT) and low-rank adaptation (LoRA). The best submissions ranked 2nd in Task 1, 2nd in Task 3 (API track), and 4th in Task 3 (GPU track).

Research Background and Motivation

Problem Definition

Traditional game development heavily relies on pre-programmed logic, with in-game events and character interactions following preset scripts and dialogue trees. To enhance player immersion and narrative depth, developers have begun incorporating LLMs as core components of NPCs, enabling them to exhibit human-like behavior and engage in dynamic, context-aware dialogue with players.

Core Challenges

Maintaining consistency and depth of dynamic characters over extended interactions presents significant challenges, particularly the phenomenon of "Flanderization." This term, derived from the character Ned Flanders in The Simpsons, refers to the gradual simplification of complex characters over time, ultimately becoming caricatured figures defined by a single exaggerated trait.

Research Motivation

Balancing Character Authenticity with Task Execution: Existing LLM-driven NPCs often neglect functional correctness when engaging in excessive role-playing
Long-term Dialogue Consistency: Maintaining character coherence across extended conversations
Multi-task Integration: Addressing the challenge of simultaneously handling task-oriented dialogue and character-consistent dialogue

Core Contributions

Proposed Deflanderization Prompting Technique: Suppresses excessive role-playing and achieves balance between dialogue generation and functional execution capabilities
Explored Complementary Strategies of Lightweight Prompting and Fine-tuning: Prompt engineering for the API track and model fine-tuning for the GPU track
Constructed Hybrid RAG+Memory Method: Combines retrieval-augmented generation and memory mechanisms to enhance dialogue foundations
Achieved Excellent Results in CPDC 2025 Competition: Secured top rankings in multiple tasks, validating method effectiveness

Methodology Details

Task Definition

The CPDC competition comprises three tasks:

Task 1: Task-oriented Dialogue Agent - Evaluates correctness of function calls and accuracy of parameter selection
Task 2: Context-aware Dialogue Agent - Evaluates NPC response consistency with specified character personas
Task 3: Integrated Context Dialogue and Task Execution - Combines Tasks 1 and 2

API Track Methodology

Deflanderization Prompting Strategy

The core idea is to guide the model to respond naturally and concisely, avoiding exaggerated role-playing. Error analysis revealed that baseline settings frequently produced overly detailed and contextually scattered outputs, focusing excessively on narrative setting rather than directly addressing player requests.

Primary Prompting Techniques:

D (Deflanderization): Prompts the model to avoid excessive role-playing
F (Few-shot): Includes two sample dialogues (merchant and guild receptionist)
CoT (Chain of Thought): Guides the model to think step-by-step
RW (Remove World Setting): Removes world-building information when constructing dialogue prompts
G (Guide): Restricts responses to 1-2 short sentences using simple language

Pipeline Design

As shown in Figure 2, the API track employs a five-step pipeline:

Prepare function call prompts
Function generation (API call #1)
Execute functions
Prepare dialogue prompts
Dialogue generation (API call #2)

GPU Track Methodology

Model Selection and Fine-tuning

Given computational constraints (AWS g5e.2xlarge instance with L40s GPU), models capable of running in this environment were selected, ultimately choosing Qwen3-14B as the primary model.

Fine-tuning Strategy:

Full SFT: Supervised fine-tuning on initial and synthetic multi-turn dialogue data
LoRA Fine-tuning: Low-rank adaptation on dialogue and function call datasets (rank=32, α=32)

Hybrid RAG+Memory Method

Retrieval Module: Uses Qwen3-Embedding-0.6B to encode player and NPC dialogue history
Injection Phase: Injects retrieved context at two stages—function selection and dialogue drafting
RAG+Refine: Rewrites generated drafts to match tone and length of high-similarity gold responses

Data Augmentation

Generated function call data using gemini-2.5-pro-preview and dialogue data using GPT-4o-mini:

Multi-turn dialogue: 2,800 data points
Multi-turn reasoning: 2,800 data points (Task 2)
Function call generation: 328 data points (Task 1)

Experimental Setup

Datasets

Task 1: train.json, sample.json - Function call data
Task 2: train.json, sample.json - Character dialogue data
Data analysis shows balanced NPC character distribution (20 merchants, 20 guild receptionists)

Evaluation Metrics

Task 1 Metrics

Function Name Exact Match: Accuracy of predicted function names exactly matching references
Function Parameter Exact Match: Accuracy of all predicted parameters exactly matching references
BERTScore: Measures semantic similarity using BERT embeddings

Task 2 Metrics

BLEU-4: Scoring based on modified n-gram precision
Word-level F1: F1 score based on vocabulary sets
CPDCscore: Weighted composite score of WordF1, BLEU, USEScore, and BERTScore

Implementation Details

API Track: GPT-4o-mini, maximum 2 API calls per turn, input limit 2000 tokens, output limit 200 tokens
GPU Track: vLLM framework deployment, dtype='bfloat16', gpu_memory_utilization=0.8

Experimental Results

API Track Main Results

Task	Method	CPDCscore
Task 1	ZeroShot	0.422
Task 1	Best Method (D+RW)	0.586
Task 3	ZeroShot	0.510
Task 3	Best Method	0.601

Key Findings:

Significant Deflanderization Effect: D strategy achieved +0.013 CPDCscore improvement over zero-shot baseline on Task 3
Further Enhancement with Few-shot Prompting: Adding few-shot examples (F) yielded +0.092 and +0.133 improvements on Task 1
Limited Gains from Complex Prompting: Complex strategies like CoT and guided responses showed marginal or inconsistent benefits

GPU Track Main Results

Model	Method	Task 1 Score	Task 2 Score	Total Score
LLaMA3.1-8B	Baseline	0.439	0.333	0.386
Qwen3-14B	SFT + LoRA	0.590	0.606	0.598

Key Findings:

Model Scale and Fine-tuning are Critical: Qwen3-14B with SFT and LoRA achieved 0.598 total score, ranking 4th
Retrieval Enhancement Provides Moderate Improvement: RAG method improved Qwen3-8B performance to 0.522
Task Trade-offs: RAG+Refine performed best on Task 1 but degraded on Task 2, while LoRA-SFT achieved better balance

Ablation Studies

Systematic ablation experiments validated component contributions:

Deflanderization vs. standard prompting
Few-shot learning vs. zero-shot learning
Comparison of different retrieval strategies
SFT vs. LoRA vs. combined methods

Game-oriented Dialogue Agents

Task-oriented Systems: Such as Kazi et al. (2024) evaluating agent planning effectiveness and goal alignment
Game Assistants: Lee et al. (2025) developing specialized game assistants for novice players
Multi-agent Frameworks: Phillips et al. (2025) using dialogue agents and goal verification agents

Tool Calling Capabilities

Function Call Architecture: Multi-step frameworks including execution, perception, verification, control, and retrieval components
Evaluation Benchmarks: τ2-Bench introducing dual-control environments for evaluating agent coordination

Role-playing LLMs

User Personalization: LaMP and similar benchmarks evaluating personalized text generation
Environment Adaptation: Role-playing in multi-agent systems like ChatDev and MetaGPT

Conclusions and Discussion

Main Conclusions

Lightweight Deflanderization Strategy is Effective: Significantly improves performance in API settings by suppressing excessive role-playing
Fine-tuned Large Models Excel in GPU Track: Qwen3-14B with SFT and LoRA achieves optimal results
Task Balance is a Key Challenge: Methods improving role-playing fidelity sometimes compromise parameter correctness

Limitations

Computational Resource Constraints: GPU track limited by L40s memory budget, restricting use of larger models
Retrieval Corpus Scale: RAG methods constrained by retrieval corpus size and quality
Evaluation Metric Limitations: Automatic evaluation metrics cannot fully reflect dialogue system quality; human evaluation is needed

Future Directions

Hybrid Strategy Exploration: Unified hybrid strategies combining lightweight prompting with retrieval-augmented fine-tuning
Long-term Consistency: Methods for maintaining character consistency across longer conversations
Multimodal Extension: Multimodal NPC systems incorporating visual and audio information

In-depth Evaluation

Strengths

Clear Problem Definition: The introduction of the Flanderization concept is novel and accurately describes a key problem in LLM role-playing
Strong Method Complementarity: Different but complementary strategies for API and GPU tracks demonstrate comprehensive technical perspective
Comprehensive Experiments: Systematic ablation studies and multi-dimensional evaluation validate method effectiveness
High Practical Value: Excellent competition results demonstrate practical applicability

Weaknesses

Insufficient Theoretical Analysis: Lacks deep theoretical analysis of the Flanderization phenomenon
Unverified Generalization: Methods primarily validated on CPDC dataset; generalization to other game scenarios unverified
Missing Computational Efficiency Analysis: Lacks detailed analysis of computational costs and inference efficiency for different methods
Insufficient User Experience Evaluation: Lacks subjective experience assessment from real players

Impact

Academic Contribution: Introduces new research directions and solutions to the game AI field
Practical Value: Methods directly applicable to NPC design in game development
Reproducibility: Provides detailed implementation details and prompt templates facilitating reproduction

Applicable Scenarios

RPG Games: Particularly suitable for role-playing games requiring rich character interactions
Educational Games: Can be used to create intelligent tutoring assistants and virtual mentors
Social Platforms: Extensible to chatbots on social platforms like Discord

References

Kazi et al. (2024): Large language models as user-agents for evaluating task-oriented-dialogue systems
Lee et al. (2025): AMAN: Agent for mentoring and assisting newbies in MMORPG
Phillips et al. (2025): Goal-oriented interactions in games using llms
Park et al. (2023): Generative agents: Interactive simulacra of human behavior
Sony AI (2025): The commonsense persona-grounded dialogue challenge 2025

This paper presents an innovative solution in the game AI field, effectively balancing NPC character authenticity with task execution capability through Deflanderization technology, providing important reference for future intelligent character design in games.