Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards
Zhuang, Chen, Zeng et al.
We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints.
We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.
academic
Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
This study explores deploying large language models (LLMs) as business development (BD) agents for online travel agencies (OTAs) in persuasive price negotiation. The agent must conduct multi-turn persuasion following standard operating procedures (SOPs) while balancing traveler affordability and hotel profitability, understanding colloquial inputs, and adhering to guardrails. Traditional post-training methods (such as supervised fine-tuning or single-reward optimization) suffer from script overfitting, lack of nuanced persuasion styles, and inability to enforce verifiable business constraints.
The authors propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns LLMs using heterogeneous rewards: preference-trained reward models (RM) for dense human alignment, reward judges (RJ) for advanced persuasive behaviors and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. In production-level evaluation, REPO significantly improves dialogue quality and issue resolution rates.
Price negotiation in online travel agencies is a complex business scenario requiring BD agents to conduct multi-turn dialogues with hotel managers, aiming to reduce room prices to improve traveler affordability while maintaining hotel profitability. Such negotiations directly impact room booking volumes, partnerships, and overall travel costs.
Negotiation Complexity: Requires nuanced, context-aware reasoning and persuasive interactions, including calibrated concessions, competitor comparisons, empathetic framing, etc.
Procedural Flow Adherence: Must infer current state and take appropriate actions within multi-stage processes according to SOPs
Verifiable Numerics and Guardrails: Outputs must satisfy strict business constraints, such as accurate pricing, valid formatting, avoiding false commitments
Persuasive and Adaptive Response Generation: Must handle diverse scenarios, including edge cases and adversarial situations
The OTA price negotiation task requires BD agents to conduct multi-turn dialogues with hotels, adjusting room prices based on market conditions. The goal is balancing traveler affordability and hotel profitability to ensure win-win outcomes.
Evaluated on four binary skills: dialogue fluency, workflow compliance, negotiation effectiveness, scope understanding. REPO leads significantly on negotiation effectiveness, the primary differentiator among methods.
Existing methods either rely on single signal sources or combine only partial reward types. REPO is the first method jointly leveraging all three signal families.
REPO successfully enables proactive price negotiation through carefully designed multi-source rewards. In human expert evaluation, REPO consistently outperforms all baseline methods in dialogue quality, excellent response occurrence rate, and problem case resolution.
The paper cites important works in reinforcement learning, dialogue systems, and controllable text generation, including:
Ouyang et al., 2022 (RLHF)
Rafailov et al., 2024 (DPO)
Shao et al., 2024 (GRPO)
Zheng et al., 2023 (LLM-as-a-judge)
Overall Assessment: This is a high-quality applied research paper that proposes valuable technical innovations while addressing practical business problems. The REPO framework design is sound, experimental evaluation is thorough, and demonstrated emergent capabilities are impressive. While there is room for improvement in generalization verification and theoretical analysis, its contributions to complex task-oriented dialogue are significant.