2025-11-17T23:01:13.424205

PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

Liu, Zhu, Al-Khalili et al.

We present PricingLogic, the first benchmark that probes whether Large Language Models(LLMs) can reliably automate tourism-related prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic reasoning.These results highlight that, despite their general capabilities, today's LLMs remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at https://github.com/EIT-NLP/PricingLogic.

academic

PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

Basic Information

Paper ID: 2510.12409
Title: PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
Authors: Yunuo Liu, Dawei Zhu, Zena Al-Khalili, Dai Cheng, Yanjun Chen, Dietrich Klakow, Wei Zhang, Xiaoyu Shen
Category: cs.AI
Publication Date: October 14, 2025
Paper Link: https://arxiv.org/abs/2510.12409

Abstract

This paper introduces PricingLogic, the first benchmark for evaluating the reasoning capabilities of large language models (LLMs) on complex tourism pricing tasks. The benchmark comprises 300 natural language questions based on 42 real-world pricing policies, spanning two difficulty levels: (1) basic customer type pricing and (2) package tour calculations involving interactive discounts. Evaluation of multiple LLMs reveals sharp performance degradation on more challenging tasks, exposing systematic failures in rule interpretation and arithmetic reasoning.

Research Background and Motivation

Problem Definition

Travel agencies urgently need to delegate error-prone pricing calculation tasks to AI systems, yet deploying LLMs without verifying their reliability could result in significant financial losses and damage customer trust. Existing benchmarks fail to adequately capture domain-specific knowledge, navigation of conflicting rules, and high reliability requirements necessary for real-world applications.

Research Significance

High Practical Value: Tourism pricing involves multiple destinations, diverse fare types, and dynamic pricing policies, making manual processing both time-consuming and error-prone
Substantial Technical Challenges: Requires reasoning under complex constraints, presenting non-trivial challenges for LLMs
Urgent Business Needs: Travel agencies seek to deploy LLM-based systems to handle naturally expressed questions

Limitations of Existing Approaches

Existing benchmarks fall short in evaluating LLMs' performance on real-world applications, particularly in scenarios requiring domain expertise, handling conflicting rules, and ensuring high reliability.

Core Contributions

First Tourism Pricing Benchmark: Proposes PricingLogic, containing 300 questions and 42 real-world pricing policy documents
Comprehensive Performance Evaluation: Conducts thorough evaluation of multiple open-source and commercial LLMs, demonstrating that this task poses significant challenges for current LLMs
Code-Assisted Reasoning Method: Demonstrates substantial improvements of code-assisted reasoning (CaR) on complex reasoning and computational tasks
Systematic Failure Analysis: Reveals systematic issues in LLMs' rule interpretation and arithmetic reasoning

Methodology Details

Task Definition

Input: Natural language tourism booking requests and corresponding pricing policy documents Output: Accurate total price calculation Constraints: Must handle multiple, overlapping fare rules and select the most favorable pricing option for customers

Dataset Construction

Data Collection

Geographic Coverage: 7 attractions, 33 different activities
Customer Types: 9 customer categories (regular tourists, contract groups, seniors, students, etc.)
Policy Complexity: Includes specific pricing structures, discount thresholds, and special conditions

Task Setup

Task 1: Standard Pricing Policies

Uses 33 pricing documents
150 test samples
No package bundling

Task 2: Package Pricing Policies

Introduces package tour discounts based on Task 1
Increases problem complexity
May involve multiple viable pricing options

Model Architecture

End-to-End (E2E) Prompting Method

Single-pass inference for pricing
Standardized pricing policy document structure and terminology
Guides LLMs through two stages: item identification and price calculation

Code-Assisted Reasoning (CaR) Method

Stage 1: Generate dedicated calculator functions for each pricing policy document Stage 2: Parse natural language orders, extract relevant information, and convert to code input parameters

Technical Innovations

Two-Stage Separation Design: Separates policy interpretation from parameter extraction, improving handling of complex pricing logic
Real-World Constraint Modeling: Addresses practical constraints such as diverse customer populations and overlapping discount rules
Oracle Control Experiments: Isolates code generation errors from parameter extraction errors through CaR-Oracle methodology

Experimental Setup

Dataset

Total Questions: 300 natural language questions
Difficulty Distribution: Easy (60), Medium (50), Hard (40) problems per task
Policy Documents: 42 real-world pricing policy documents

Evaluation Metrics

Uses exact match comparison between model predictions and correct answers, reporting accuracy rates

Baseline Methods

Evaluates multiple state-of-the-art LLMs:

Commercial Models: GPT-4o, DeepSeek-V3/R1, Claude Sonnet 4
Open-Source Models: Qwen2.5-7B/32B/Max

Implementation Details

Temperature set to 0.0 for deterministic outputs
Introduces CaR-Oracle control conditions to isolate error sources
Compares 0-shot and 3-shot performance

Experimental Results

Main Results

Task 1 Results

Easy Problems:

E2E Method: All models except Qwen2.5-7B achieve accuracy above 76%
CaR Method: Most models achieve accuracy above 90%
Best Performance: Claude Sonnet 4 reaches 96.67% (CaR)

Hard Problems:

E2E Method: All models barely exceed 50% accuracy
CaR Method: Still below 60%, with substantial room for improvement

Task 2 Results

Significant Performance Degradation:

Even the strongest Claude Sonnet 4 achieves only 35.0% E2E accuracy on hard problems
CaR method provides significant improvements, particularly on medium-difficulty problems

Ablation Studies

CaR-Oracle Analysis

Simple Tasks: Three LLMs achieve 100% accuracy using oracle code
Medium Tasks: Generated code has major defects, but strong LLMs still correctly map parameters
Hard Tasks: Even with human-written code, models struggle to provide correct parameters

3-shot vs 0-shot Comparison

3-shot prompting yields only marginal improvements
No improvement in complex scenarios
Suggests performance limitations reflect fundamental reasoning challenges rather than insufficient demonstrations

Case Analysis

Error Pattern Analysis

Customer Category Misidentification: Models frequently misidentify customer types
Pricing Condition Omission: Overlook important pricing conditions
Package Logic Errors: Struggle to identify when package discounts should apply
Optimal Combination Calculation Failure: Unable to compute optimal combinations among multiple valid package options

Code Quality Differences

LLM-Generated Code: Simplified linear if-elif structures
Human-Written Code: Complex multi-option evaluation systems with systematic comparison and optimization

LLMs in Real-World Applications

Recent research focuses on evaluating LLMs in practical applications
RuleArena tests rule-following capabilities but lacks handling of rule conflicts
This work extends the paradigm to real tourism pricing domains

Code-Assisted Reasoning

Improves LLM reasoning on computation-intensive tasks through code
Prior work primarily targets controlled mathematical problems
This method extends the paradigm to real-world applications beyond textbook problem complexity

Conclusions and Discussion

Main Conclusions

Performance Limitations: Even advanced LLMs perform poorly on complex pricing scenarios
CaR Method Effectiveness: Code-assisted reasoning generally outperforms end-to-end approaches
Systematic Challenges: Tasks involving multiple overlapping rules expose fundamental LLM limitations

Limitations

Limited Method Scope: Focuses only on E2E and CaR methods, unexplored other approaches like fine-tuning
Dynamic Environment Challenges: Fine-tuning approaches are impractical in dynamic business environments
Evaluation Scope: Primarily concentrated on tourism pricing domain

Future Directions

Domain Adaptation Techniques: Develop specialized safeguards for revenue-critical applications
Hybrid Reasoning Systems: Combine symbolic and neural reasoning approaches
Real-Time Verification Mechanisms: Develop real-time error detection and correction mechanisms

In-Depth Evaluation

Strengths

Significant Practical Value: Addresses genuine business needs with direct application potential
Rigorous Benchmark Design: Constructed from real data with clear difficulty stratification
Methodological Innovation: CaR method design is elegant, effectively isolating different error types
Comprehensive Analysis: Deep failure pattern analysis through controlled experiments like Oracle methodology

Weaknesses

Domain Limitation: Primarily focused on tourism pricing; generalization capability remains to be verified
Limited Model Coverage: Does not include more diverse model architectures and training strategies
Insufficient Solutions: Primarily identifies problems with relatively limited proposed solutions

Impact

Academic Contribution: Provides important evidence of LLM limitations in complex reasoning tasks
Practical Value: Offers important reference for AI applications in the tourism industry
Methodological Contribution: CaR method is generalizable to other domains requiring complex computation

Applicable Scenarios

Rule-Intensive Applications: Suitable for scenarios requiring handling of complex, overlapping rules
Computation-Intensive Tasks: Application domains requiring precise numerical calculations
Business-Critical Systems: Revenue-critical applications with extremely high accuracy requirements

References

The paper cites important works across related fields, including:

Research on code generation and mathematical problem solving
Evaluation work on LLMs in real-world applications
Methods related to program-aided language models

Summary: This paper systematically reveals limitations of current LLMs in handling complex, real-world reasoning tasks by constructing PricingLogic, the first tourism pricing benchmark. While code-assisted reasoning methods bring significant improvements, substantial gaps remain on the most challenging tasks, emphasizing the importance of rigorous evaluation before deploying AI systems in revenue-critical applications.