2025-11-22T12:04:16.552264

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

Zhuang, Chen, Zeng et al.
We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.
academic

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

Basic Information

  • Paper ID: 2510.04214
  • Title: Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
  • Authors: Zhuoran Zhuang, Ye Chen, Xia Zeng*, Chao Luo, Luhui Liu and Yihan Chen (Fliggy Alibaba)
  • Classification: cs.CL
  • Publication Date: October 11, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2510.04214v2

Abstract

This study explores deploying large language models (LLMs) as business development (BD) agents for online travel agencies (OTAs) in persuasive price negotiation. The agent must conduct multi-turn persuasion following standard operating procedures (SOPs) while balancing traveler affordability and hotel profitability, understanding colloquial inputs, and adhering to guardrails. Traditional post-training methods (such as supervised fine-tuning or single-reward optimization) suffer from script overfitting, lack of nuanced persuasion styles, and inability to enforce verifiable business constraints.

The authors propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns LLMs using heterogeneous rewards: preference-trained reward models (RM) for dense human alignment, reward judges (RJ) for advanced persuasive behaviors and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. In production-level evaluation, REPO significantly improves dialogue quality and issue resolution rates.

Research Background and Motivation

Problem Definition

Price negotiation in online travel agencies is a complex business scenario requiring BD agents to conduct multi-turn dialogues with hotel managers, aiming to reduce room prices to improve traveler affordability while maintaining hotel profitability. Such negotiations directly impact room booking volumes, partnerships, and overall travel costs.

Challenge Analysis

  1. Negotiation Complexity: Requires nuanced, context-aware reasoning and persuasive interactions, including calibrated concessions, competitor comparisons, empathetic framing, etc.
  2. Procedural Flow Adherence: Must infer current state and take appropriate actions within multi-stage processes according to SOPs
  3. Verifiable Numerics and Guardrails: Outputs must satisfy strict business constraints, such as accurate pricing, valid formatting, avoiding false commitments
  4. Persuasive and Adaptive Response Generation: Must handle diverse scenarios, including edge cases and adversarial situations

Limitations of Existing Methods

  • Supervised Fine-Tuning (SFT): Prone to overfitting training data with limited generalization
  • Direct Preference Optimization (DPO): Dependent on preference data quality, lacking mechanisms to enforce structured business rules
  • Traditional Reinforcement Learning (PPO/GRPO): Unstable training dynamics, susceptible to "reward hacking"

Core Contributions

  1. First LLM Study on Industrial-Grade Price Negotiation Task: Addresses complex, long-horizon persuasion scenarios beyond traditional QA tasks
  2. Proposes REPO Framework: First complex task-oriented dialogue alignment framework aggregating preference, judgment, and programmatic rewards
  3. Comprehensive Evaluation Verification: Demonstrates REPO's superiority in negotiation effectiveness, compliance, and emergent persuasive capabilities, surpassing human-annotated gold standards

Methodology Details

Task Definition

The OTA price negotiation task requires BD agents to conduct multi-turn dialogues with hotels, adjusting room prices based on market conditions. The goal is balancing traveler affordability and hotel profitability to ensure win-win outcomes.

REPO Architecture

Three-Source Reward Design

  1. Reward Model (RM): Model trained on preference data providing dense human alignment signals, learning human BD persuasion styles and strategies
  2. Reward Judge (RJ): LLM-as-a-judge framework evaluating high-level behaviors such as SOP compliance, emotional value, and persuasion style
  3. Programmatic Reward Function (RF): Deterministic checks on business numerics, formatting, guardrails, and length requirements

Reward Enhancement Mechanism

REPO employs a stability-preserving modulation strategy, using RJ and RF as auxiliary signals to scale the primary RM signal:

Eenh = clip(Ejudge + Efunc, -n, n)
Rtotal = Rmodel(1 ± Eenh/n)

This sign-aware, magnitude-sensitive scaling achieves:

  • When Rmodel > 0 and Eenh > 0, rewards are amplified
  • When Rmodel > 0 and Eenh < 0, rewards are suppressed
  • When Rmodel < 0, penalties are correspondingly reduced or amplified

Efficient Computational Optimization

  1. LoRA Adapters: Low-rank adaptation on policy and value networks reducing memory and accelerating training
  2. Reference-Free Model: No KL penalty; LoRA's low-rank constraints support stable updates
  3. Group-Free Computation: Avoids group-based scoring and aggregation, computing rewards per trajectory

Experimental Setup

Models and Parameters

  • Base Model: Qwen3-32B-Instruct
  • Maximum Response Length: 512 tokens
  • Batch Size: 128
  • LoRA Configuration: rank=64, alpha=64
  • Learning Rate: 1e-6
  • Training Epochs: Supervised stage (SFT/DPO) 10 epochs, RL stage (PPO/GRPO/REPO) 2 epochs

Training Data

Constructed high-quality preference dataset with 6,632 samples:

  • 252 cases from online production
  • 3,178 samples annotated by language experts
  • 1,211 samples annotated by task experts (human BDs)
  • 1,991 preference data enriched by human BDs after initial SFT annotation

Evaluation Data

  1. Online Samples: 30 complete production dialogues (~150 turns) reflecting real distribution of hotel intentions
  2. Problem Case Collection: 45 dialogues (~225 turns) curated by business experts, covering various issues where base models fail

Comparison Methods

  • SFT: Supervised Fine-Tuning
  • DPO: Direct Preference Optimization
  • PPO: Proximal Policy Optimization
  • GRPO: Group Relative Policy Optimization

Experimental Results

Main Results

Online Sample Evaluation

Evaluated using two metrics:

  1. Overall Dialogue Score (1-5 scale): REPO achieves 4.63, +1.20 over baseline, +0.83 over DPO, +0.33 over GRPO
  2. Excellent Response Dialogue Ratio: REPO reaches 66.67%, 5× baseline (13.33%), ~2× DPO (33.33%), +23.34 percentage points over GRPO

Problem Case Resolution

  • Overall Resolution Rate: REPO, DPO, and SFT all achieve 93.33%
  • Clean Resolution Rate: REPO highest (75.56%), significantly outperforming other methods
  • Severe Unresolved Cases: REPO at 0%, best performance

Ablation Studies

Emergent Negotiation Capability Analysis

By tracking persuasion ability scores during training, REPO demonstrates three phases:

  1. Initial Phase (0-30 steps): Unstable exploration
  2. Learning Phase (30-100 steps): Steady policy improvement
  3. Convergence Phase (100-190 steps): Performance stabilization

Final checkpoint shows ~30% improvement over early checkpoints.

Fine-Grained Dialogue Skill Assessment

Evaluated on four binary skills: dialogue fluency, workflow compliance, negotiation effectiveness, scope understanding. REPO leads significantly on negotiation effectiveness, the primary differentiator among methods.

Case Analysis

The paper demonstrates emergent capabilities after REPO training:

  1. Emotional Value + Root Cause Reasoning: Provides richer context-aware reasoning than gold standard
  2. Hotel-Type-Targeted Pitching: Combines competitor-aware reasoning
  3. Persuasion Under Information Constraints: Reframes requests using exposure and conversion logic

Task-Oriented Dialogue Systems and LLM Alignment

Existing research primarily focuses on passive, user-initiated tasks. Proactive price negotiation requires long-horizon persuasion strategies combining context-based reasoning and calibrated emotional intelligence.

Controllable Text Generation and Multi-Reward Aggregation

Existing methods either rely on single signal sources or combine only partial reward types. REPO is the first method jointly leveraging all three signal families.

Conclusions and Discussion

Main Conclusions

REPO successfully enables proactive price negotiation through carefully designed multi-source rewards. In human expert evaluation, REPO consistently outperforms all baseline methods in dialogue quality, excellent response occurrence rate, and problem case resolution.

Limitations

  1. Limited Evaluation Scope: Evaluated only on price negotiation task, requiring validation across broader tasks and settings
  2. Computational Resource Requirements: Requires substantial computational resources for training
  3. Domain Specificity: Method designed for specific business scenarios

Future Directions

  1. Extension to smaller model backbones
  2. Application to broader domains and languages
  3. Improved reward design

In-Depth Evaluation

Strengths

  1. High Practical Value: Addresses complex problems in real business scenarios
  2. Strong Methodological Innovation: First systematic combination of three heterogeneous reward signals
  3. Comprehensive Evaluation: Includes production-level data and multi-dimensional evaluation metrics
  4. Reasonable Technical Implementation: Achieves efficient training through techniques like LoRA
  5. Significant Emergent Capabilities: Demonstrates persuasive abilities surpassing human annotation

Weaknesses

  1. Insufficient Generalization Verification: Validated only on single task, lacking cross-domain evaluation
  2. Limited Theoretical Analysis: Lacks theoretical guarantees for reward combination mechanisms
  3. Insufficient Computational Cost Analysis: Lacks detailed analysis of computational overhead compared to baselines
  4. Unknown Long-Term Effects: Lacks analysis of long-term deployment effectiveness

Impact

  1. Academic Contribution: Provides new insights for LLM alignment in complex task-oriented dialogue
  2. Industrial Value: Direct application to real business scenarios with strong practicality
  3. Methodological Inspiration: Heterogeneous reward integration approach generalizable to other complex tasks

Applicable Scenarios

  1. Customer Service and Sales Dialogue Systems: Scenarios requiring persuasion and negotiation capabilities
  2. Multi-Constraint Optimization Tasks: Generation tasks requiring simultaneous satisfaction of multiple constraint types
  3. Business Process Automation: Automated systems requiring adherence to complex SOPs

References

The paper cites important works in reinforcement learning, dialogue systems, and controllable text generation, including:

  • Ouyang et al., 2022 (RLHF)
  • Rafailov et al., 2024 (DPO)
  • Shao et al., 2024 (GRPO)
  • Zheng et al., 2023 (LLM-as-a-judge)

Overall Assessment: This is a high-quality applied research paper that proposes valuable technical innovations while addressing practical business problems. The REPO framework design is sound, experimental evaluation is thorough, and demonstrated emergent capabilities are impressive. While there is room for improvement in generalization verification and theoretical analysis, its contributions to complex task-oriented dialogue are significant.