2025-11-22T12:04:16.552264

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

Zhuang, Chen, Zeng et al.

We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.

academic

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

Basic Information

Paper ID: 2510.04214
Title: Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
Authors: Zhuoran Zhuang, Ye Chen, Xia Zeng*, Chao Luo, Luhui Liu and Yihan Chen (Fliggy Alibaba)
Classification: cs.CL
Publication Date: October 11, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2510.04214v2

Abstract

This study explores deploying large language models (LLMs) as business development (BD) agents for online travel agencies (OTAs) in persuasive price negotiation. The agent must conduct multi-turn persuasion following standard operating procedures (SOPs) while balancing traveler affordability and hotel profitability, understanding colloquial inputs, and adhering to guardrails. Traditional post-training methods (such as supervised fine-tuning or single-reward optimization) suffer from script overfitting, lack of nuanced persuasion styles, and inability to enforce verifiable business constraints.

The authors propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns LLMs using heterogeneous rewards: preference-trained reward models (RM) for dense human alignment, reward judges (RJ) for advanced persuasive behaviors and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. In production-level evaluation, REPO significantly improves dialogue quality and issue resolution rates.

Research Background and Motivation

Problem Definition

Price negotiation in online travel agencies is a complex business scenario requiring BD agents to conduct multi-turn dialogues with hotel managers, aiming to reduce room prices to improve traveler affordability while maintaining hotel profitability. Such negotiations directly impact room booking volumes, partnerships, and overall travel costs.

Challenge Analysis

Negotiation Complexity: Requires nuanced, context-aware reasoning and persuasive interactions, including calibrated concessions, competitor comparisons, empathetic framing, etc.
Procedural Flow Adherence: Must infer current state and take appropriate actions within multi-stage processes according to SOPs
Verifiable Numerics and Guardrails: Outputs must satisfy strict business constraints, such as accurate pricing, valid formatting, avoiding false commitments
Persuasive and Adaptive Response Generation: Must handle diverse scenarios, including edge cases and adversarial situations

Limitations of Existing Methods

Supervised Fine-Tuning (SFT): Prone to overfitting training data with limited generalization
Direct Preference Optimization (DPO): Dependent on preference data quality, lacking mechanisms to enforce structured business rules
Traditional Reinforcement Learning (PPO/GRPO): Unstable training dynamics, susceptible to "reward hacking"

Core Contributions

First LLM Study on Industrial-Grade Price Negotiation Task: Addresses complex, long-horizon persuasion scenarios beyond traditional QA tasks
Proposes REPO Framework: First complex task-oriented dialogue alignment framework aggregating preference, judgment, and programmatic rewards
Comprehensive Evaluation Verification: Demonstrates REPO's superiority in negotiation effectiveness, compliance, and emergent persuasive capabilities, surpassing human-annotated gold standards

Methodology Details

Task Definition

The OTA price negotiation task requires BD agents to conduct multi-turn dialogues with hotels, adjusting room prices based on market conditions. The goal is balancing traveler affordability and hotel profitability to ensure win-win outcomes.

REPO Architecture

Three-Source Reward Design

Reward Model (RM): Model trained on preference data providing dense human alignment signals, learning human BD persuasion styles and strategies
Reward Judge (RJ): LLM-as-a-judge framework evaluating high-level behaviors such as SOP compliance, emotional value, and persuasion style
Programmatic Reward Function (RF): Deterministic checks on business numerics, formatting, guardrails, and length requirements

Reward Enhancement Mechanism

REPO employs a stability-preserving modulation strategy, using RJ and RF as auxiliary signals to scale the primary RM signal:

Eenh = clip(Ejudge + Efunc, -n, n)
Rtotal = Rmodel(1 ± Eenh/n)

This sign-aware, magnitude-sensitive scaling achieves:

When Rmodel > 0 and Eenh > 0, rewards are amplified
When Rmodel > 0 and Eenh < 0, rewards are suppressed
When Rmodel < 0, penalties are correspondingly reduced or amplified

Efficient Computational Optimization

LoRA Adapters: Low-rank adaptation on policy and value networks reducing memory and accelerating training
Reference-Free Model: No KL penalty; LoRA's low-rank constraints support stable updates
Group-Free Computation: Avoids group-based scoring and aggregation, computing rewards per trajectory

Experimental Setup

Models and Parameters

Base Model: Qwen3-32B-Instruct
Maximum Response Length: 512 tokens
Batch Size: 128
LoRA Configuration: rank=64, alpha=64
Learning Rate: 1e-6
Training Epochs: Supervised stage (SFT/DPO) 10 epochs, RL stage (PPO/GRPO/REPO) 2 epochs

Training Data

Constructed high-quality preference dataset with 6,632 samples:

252 cases from online production
3,178 samples annotated by language experts
1,211 samples annotated by task experts (human BDs)
1,991 preference data enriched by human BDs after initial SFT annotation

Evaluation Data

Online Samples: 30 complete production dialogues (~150 turns) reflecting real distribution of hotel intentions
Problem Case Collection: 45 dialogues (~225 turns) curated by business experts, covering various issues where base models fail

Comparison Methods

SFT: Supervised Fine-Tuning
DPO: Direct Preference Optimization
PPO: Proximal Policy Optimization
GRPO: Group Relative Policy Optimization

Experimental Results

Main Results

Online Sample Evaluation

Evaluated using two metrics:

Overall Dialogue Score (1-5 scale): REPO achieves 4.63, +1.20 over baseline, +0.83 over DPO, +0.33 over GRPO
Excellent Response Dialogue Ratio: REPO reaches 66.67%, 5× baseline (13.33%), ~2× DPO (33.33%), +23.34 percentage points over GRPO

Problem Case Resolution

Overall Resolution Rate: REPO, DPO, and SFT all achieve 93.33%
Clean Resolution Rate: REPO highest (75.56%), significantly outperforming other methods
Severe Unresolved Cases: REPO at 0%, best performance

Ablation Studies

Emergent Negotiation Capability Analysis

By tracking persuasion ability scores during training, REPO demonstrates three phases:

Initial Phase (0-30 steps): Unstable exploration
Learning Phase (30-100 steps): Steady policy improvement
Convergence Phase (100-190 steps): Performance stabilization

Final checkpoint shows ~30% improvement over early checkpoints.

Fine-Grained Dialogue Skill Assessment

Evaluated on four binary skills: dialogue fluency, workflow compliance, negotiation effectiveness, scope understanding. REPO leads significantly on negotiation effectiveness, the primary differentiator among methods.

Case Analysis

The paper demonstrates emergent capabilities after REPO training:

Emotional Value + Root Cause Reasoning: Provides richer context-aware reasoning than gold standard
Hotel-Type-Targeted Pitching: Combines competitor-aware reasoning
Persuasion Under Information Constraints: Reframes requests using exposure and conversion logic

Task-Oriented Dialogue Systems and LLM Alignment

Existing research primarily focuses on passive, user-initiated tasks. Proactive price negotiation requires long-horizon persuasion strategies combining context-based reasoning and calibrated emotional intelligence.

Controllable Text Generation and Multi-Reward Aggregation

Existing methods either rely on single signal sources or combine only partial reward types. REPO is the first method jointly leveraging all three signal families.

Conclusions and Discussion

Main Conclusions

REPO successfully enables proactive price negotiation through carefully designed multi-source rewards. In human expert evaluation, REPO consistently outperforms all baseline methods in dialogue quality, excellent response occurrence rate, and problem case resolution.

Limitations

Limited Evaluation Scope: Evaluated only on price negotiation task, requiring validation across broader tasks and settings
Computational Resource Requirements: Requires substantial computational resources for training
Domain Specificity: Method designed for specific business scenarios

Future Directions

Extension to smaller model backbones
Application to broader domains and languages
Improved reward design

In-Depth Evaluation

Strengths

High Practical Value: Addresses complex problems in real business scenarios
Strong Methodological Innovation: First systematic combination of three heterogeneous reward signals
Comprehensive Evaluation: Includes production-level data and multi-dimensional evaluation metrics
Reasonable Technical Implementation: Achieves efficient training through techniques like LoRA
Significant Emergent Capabilities: Demonstrates persuasive abilities surpassing human annotation

Weaknesses

Insufficient Generalization Verification: Validated only on single task, lacking cross-domain evaluation
Limited Theoretical Analysis: Lacks theoretical guarantees for reward combination mechanisms
Insufficient Computational Cost Analysis: Lacks detailed analysis of computational overhead compared to baselines
Unknown Long-Term Effects: Lacks analysis of long-term deployment effectiveness

Impact

Academic Contribution: Provides new insights for LLM alignment in complex task-oriented dialogue
Industrial Value: Direct application to real business scenarios with strong practicality
Methodological Inspiration: Heterogeneous reward integration approach generalizable to other complex tasks

Applicable Scenarios

Customer Service and Sales Dialogue Systems: Scenarios requiring persuasion and negotiation capabilities
Multi-Constraint Optimization Tasks: Generation tasks requiring simultaneous satisfaction of multiple constraint types
Business Process Automation: Automated systems requiring adherence to complex SOPs

References

The paper cites important works in reinforcement learning, dialogue systems, and controllable text generation, including:

Ouyang et al., 2022 (RLHF)
Rafailov et al., 2024 (DPO)
Shao et al., 2024 (GRPO)
Zheng et al., 2023 (LLM-as-a-judge)

Overall Assessment: This is a high-quality applied research paper that proposes valuable technical innovations while addressing practical business problems. The REPO framework design is sound, experimental evaluation is thorough, and demonstrated emergent capabilities are impressive. While there is room for improvement in generalization verification and theoretical analysis, its contributions to complex task-oriented dialogue are significant.