2025-11-17T23:01:13.424205

PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

Liu, Zhu, Al-Khalili et al.
We present PricingLogic, the first benchmark that probes whether Large Language Models(LLMs) can reliably automate tourism-related prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic reasoning.These results highlight that, despite their general capabilities, today's LLMs remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at https://github.com/EIT-NLP/PricingLogic.
academic

PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

Basic Information

  • Paper ID: 2510.12409
  • Title: PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
  • Authors: Yunuo Liu, Dawei Zhu, Zena Al-Khalili, Dai Cheng, Yanjun Chen, Dietrich Klakow, Wei Zhang, Xiaoyu Shen
  • Category: cs.AI
  • Publication Date: October 14, 2025
  • Paper Link: https://arxiv.org/abs/2510.12409

Abstract

This paper introduces PricingLogic, the first benchmark for evaluating the reasoning capabilities of large language models (LLMs) on complex tourism pricing tasks. The benchmark comprises 300 natural language questions based on 42 real-world pricing policies, spanning two difficulty levels: (1) basic customer type pricing and (2) package tour calculations involving interactive discounts. Evaluation of multiple LLMs reveals sharp performance degradation on more challenging tasks, exposing systematic failures in rule interpretation and arithmetic reasoning.

Research Background and Motivation

Problem Definition

Travel agencies urgently need to delegate error-prone pricing calculation tasks to AI systems, yet deploying LLMs without verifying their reliability could result in significant financial losses and damage customer trust. Existing benchmarks fail to adequately capture domain-specific knowledge, navigation of conflicting rules, and high reliability requirements necessary for real-world applications.

Research Significance

  1. High Practical Value: Tourism pricing involves multiple destinations, diverse fare types, and dynamic pricing policies, making manual processing both time-consuming and error-prone
  2. Substantial Technical Challenges: Requires reasoning under complex constraints, presenting non-trivial challenges for LLMs
  3. Urgent Business Needs: Travel agencies seek to deploy LLM-based systems to handle naturally expressed questions

Limitations of Existing Approaches

Existing benchmarks fall short in evaluating LLMs' performance on real-world applications, particularly in scenarios requiring domain expertise, handling conflicting rules, and ensuring high reliability.

Core Contributions

  1. First Tourism Pricing Benchmark: Proposes PricingLogic, containing 300 questions and 42 real-world pricing policy documents
  2. Comprehensive Performance Evaluation: Conducts thorough evaluation of multiple open-source and commercial LLMs, demonstrating that this task poses significant challenges for current LLMs
  3. Code-Assisted Reasoning Method: Demonstrates substantial improvements of code-assisted reasoning (CaR) on complex reasoning and computational tasks
  4. Systematic Failure Analysis: Reveals systematic issues in LLMs' rule interpretation and arithmetic reasoning

Methodology Details

Task Definition

Input: Natural language tourism booking requests and corresponding pricing policy documents Output: Accurate total price calculation Constraints: Must handle multiple, overlapping fare rules and select the most favorable pricing option for customers

Dataset Construction

Data Collection

  • Geographic Coverage: 7 attractions, 33 different activities
  • Customer Types: 9 customer categories (regular tourists, contract groups, seniors, students, etc.)
  • Policy Complexity: Includes specific pricing structures, discount thresholds, and special conditions

Task Setup

Task 1: Standard Pricing Policies

  • Uses 33 pricing documents
  • 150 test samples
  • No package bundling

Task 2: Package Pricing Policies

  • Introduces package tour discounts based on Task 1
  • Increases problem complexity
  • May involve multiple viable pricing options

Model Architecture

End-to-End (E2E) Prompting Method

  • Single-pass inference for pricing
  • Standardized pricing policy document structure and terminology
  • Guides LLMs through two stages: item identification and price calculation

Code-Assisted Reasoning (CaR) Method

Stage 1: Generate dedicated calculator functions for each pricing policy document Stage 2: Parse natural language orders, extract relevant information, and convert to code input parameters

Technical Innovations

  1. Two-Stage Separation Design: Separates policy interpretation from parameter extraction, improving handling of complex pricing logic
  2. Real-World Constraint Modeling: Addresses practical constraints such as diverse customer populations and overlapping discount rules
  3. Oracle Control Experiments: Isolates code generation errors from parameter extraction errors through CaR-Oracle methodology

Experimental Setup

Dataset

  • Total Questions: 300 natural language questions
  • Difficulty Distribution: Easy (60), Medium (50), Hard (40) problems per task
  • Policy Documents: 42 real-world pricing policy documents

Evaluation Metrics

Uses exact match comparison between model predictions and correct answers, reporting accuracy rates

Baseline Methods

Evaluates multiple state-of-the-art LLMs:

  • Commercial Models: GPT-4o, DeepSeek-V3/R1, Claude Sonnet 4
  • Open-Source Models: Qwen2.5-7B/32B/Max

Implementation Details

  • Temperature set to 0.0 for deterministic outputs
  • Introduces CaR-Oracle control conditions to isolate error sources
  • Compares 0-shot and 3-shot performance

Experimental Results

Main Results

Task 1 Results

Easy Problems:

  • E2E Method: All models except Qwen2.5-7B achieve accuracy above 76%
  • CaR Method: Most models achieve accuracy above 90%
  • Best Performance: Claude Sonnet 4 reaches 96.67% (CaR)

Hard Problems:

  • E2E Method: All models barely exceed 50% accuracy
  • CaR Method: Still below 60%, with substantial room for improvement

Task 2 Results

Significant Performance Degradation:

  • Even the strongest Claude Sonnet 4 achieves only 35.0% E2E accuracy on hard problems
  • CaR method provides significant improvements, particularly on medium-difficulty problems

Ablation Studies

CaR-Oracle Analysis

  • Simple Tasks: Three LLMs achieve 100% accuracy using oracle code
  • Medium Tasks: Generated code has major defects, but strong LLMs still correctly map parameters
  • Hard Tasks: Even with human-written code, models struggle to provide correct parameters

3-shot vs 0-shot Comparison

  • 3-shot prompting yields only marginal improvements
  • No improvement in complex scenarios
  • Suggests performance limitations reflect fundamental reasoning challenges rather than insufficient demonstrations

Case Analysis

Error Pattern Analysis

  1. Customer Category Misidentification: Models frequently misidentify customer types
  2. Pricing Condition Omission: Overlook important pricing conditions
  3. Package Logic Errors: Struggle to identify when package discounts should apply
  4. Optimal Combination Calculation Failure: Unable to compute optimal combinations among multiple valid package options

Code Quality Differences

  • LLM-Generated Code: Simplified linear if-elif structures
  • Human-Written Code: Complex multi-option evaluation systems with systematic comparison and optimization

LLMs in Real-World Applications

  • Recent research focuses on evaluating LLMs in practical applications
  • RuleArena tests rule-following capabilities but lacks handling of rule conflicts
  • This work extends the paradigm to real tourism pricing domains

Code-Assisted Reasoning

  • Improves LLM reasoning on computation-intensive tasks through code
  • Prior work primarily targets controlled mathematical problems
  • This method extends the paradigm to real-world applications beyond textbook problem complexity

Conclusions and Discussion

Main Conclusions

  1. Performance Limitations: Even advanced LLMs perform poorly on complex pricing scenarios
  2. CaR Method Effectiveness: Code-assisted reasoning generally outperforms end-to-end approaches
  3. Systematic Challenges: Tasks involving multiple overlapping rules expose fundamental LLM limitations

Limitations

  1. Limited Method Scope: Focuses only on E2E and CaR methods, unexplored other approaches like fine-tuning
  2. Dynamic Environment Challenges: Fine-tuning approaches are impractical in dynamic business environments
  3. Evaluation Scope: Primarily concentrated on tourism pricing domain

Future Directions

  1. Domain Adaptation Techniques: Develop specialized safeguards for revenue-critical applications
  2. Hybrid Reasoning Systems: Combine symbolic and neural reasoning approaches
  3. Real-Time Verification Mechanisms: Develop real-time error detection and correction mechanisms

In-Depth Evaluation

Strengths

  1. Significant Practical Value: Addresses genuine business needs with direct application potential
  2. Rigorous Benchmark Design: Constructed from real data with clear difficulty stratification
  3. Methodological Innovation: CaR method design is elegant, effectively isolating different error types
  4. Comprehensive Analysis: Deep failure pattern analysis through controlled experiments like Oracle methodology

Weaknesses

  1. Domain Limitation: Primarily focused on tourism pricing; generalization capability remains to be verified
  2. Limited Model Coverage: Does not include more diverse model architectures and training strategies
  3. Insufficient Solutions: Primarily identifies problems with relatively limited proposed solutions

Impact

  1. Academic Contribution: Provides important evidence of LLM limitations in complex reasoning tasks
  2. Practical Value: Offers important reference for AI applications in the tourism industry
  3. Methodological Contribution: CaR method is generalizable to other domains requiring complex computation

Applicable Scenarios

  1. Rule-Intensive Applications: Suitable for scenarios requiring handling of complex, overlapping rules
  2. Computation-Intensive Tasks: Application domains requiring precise numerical calculations
  3. Business-Critical Systems: Revenue-critical applications with extremely high accuracy requirements

References

The paper cites important works across related fields, including:

  • Research on code generation and mathematical problem solving
  • Evaluation work on LLMs in real-world applications
  • Methods related to program-aided language models

Summary: This paper systematically reveals limitations of current LLMs in handling complex, real-world reasoning tasks by constructing PricingLogic, the first tourism pricing benchmark. While code-assisted reasoning methods bring significant improvements, substantial gaps remain on the most challenging tasks, emphasizing the importance of rigorous evaluation before deploying AI systems in revenue-critical applications.