2025-11-28T21:52:20.176299

LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models

Tantakoun, Zhu, Muise
Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems requiring structured reasoning. This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities. However, identifying optimal AP deployment frameworks can be daunting and introduces new challenges. This paper aims to provide a timely survey of the current research with an in-depth analysis, positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf AP planners. By systematically reviewing the current state of research, we highlight methodologies, and identify critical challenges and future directions, hoping to contribute to the joint research on NLP and Automated Planning.
academic

LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models

Basic Information

  • Paper ID: 2503.18971
  • Title: LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models
  • Authors: Marcus Tantakoun, Christian Muise, Xiaodan Zhu (Queen's University)
  • Classification: cs.AI
  • Publication Date: March 2025 (arXiv v2: October 25, 2025)
  • Paper Link: https://arxiv.org/abs/2503.18971v2

Abstract

Large Language Models (LLMs) demonstrate exceptional performance on various natural language tasks but continue to struggle with long-horizon planning problems requiring structured reasoning. This paper provides a timely survey that systematically analyzes the current state of research positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf automated planning (AP) systems. Through systematic review of approximately 80 related works, the paper highlights methodologies, identifies key challenges and future directions, and provides an open-source Python library Language-to-Plan (L2P) to facilitate research in this field.

Research Background and Motivation

1. Core Problem

Despite LLMs' superior performance on natural language processing tasks, they perform poorly on long-horizon planning and reasoning tasks, frequently producing unreliable plans. Using LLMs directly as planners (LLM-as-Planner) cannot guarantee the correctness, optimality, and reliability of outputs.

2. Problem Significance

  • Nature of Planning: Planning is a crucial component of System II cognition, requiring structured reasoning, whereas LLMs excel at System I tasks
  • Practical Application Bottleneck: Extracting planning models has long been a major obstacle to widespread application of planning technology
  • Reliability Requirements: Practical applications require verifiable, interpretable, and robust planning solutions

3. Limitations of Existing Approaches

  • Direct Planning Methods: When LLMs directly generate action sequences, performance degrades with iterative feedback
  • Lack of Structured Guarantees: LLMs cannot provide correctness guarantees like classical planning systems
  • Long-term Dependency Issues: As scale increases, LLMs frequently fail to consider action effects and preconditions

4. Research Motivation

This paper proposes the LLMs-as-Formalizers paradigm: leveraging LLMs' strengths (extracting, interpreting, and refining planning model specifications from natural language) combined with classical automated planning systems' strengths (structured representations, logic, and search methods) to construct complementary neuro-symbolic frameworks.

Core Contributions

  1. Systematic Taxonomy: Proposes the first comprehensive classification system for LLM-driven automated planning model construction, including:
    • Model Generation: task modeling, domain modeling, hybrid modeling
    • Model Editing: code refinement and error correction
    • Model Benchmarks: evaluation frameworks and datasets
  2. Technical Methods Summary: Systematically reviews shared and innovative technical methods for integrating LLMs into AI planning frameworks and their limitations
  3. Research Question Framework: Proposes two core research questions (RQ):
    • RQ1: How can LLMs accurately align with human objectives to ensure planning model specifications correctly represent desired expectations and goals?
    • RQ2: To what extent and granularity can natural language instructions be effectively converted into accurate planning model definitions?
  4. Open-Source Toolkit: Provides Language-to-Plan (L2P) open-source Python library implementing landmark paper methods covered in the survey, supporting:
    • Comprehensive PDDL extraction and refinement tool suite
    • Modular design supporting flexible prompting styles and custom pipelines
    • Fully autonomous end-to-end pipeline capabilities
  5. Future Direction Guidance: Identifies key challenges and outlines future research directions for the field

Methodology Details

Task Definition

This survey focuses on the LLMs-as-Formalizers paradigm, using LLMs to construct automated planning (AP) model specifications (primarily in PDDL format), which are then solved by domain-independent planners. This contrasts with:

  • LLMs-as-Planners: LLMs directly generate action sequences
  • LLMs-as-Heuristics: LLMs enhance search efficiency through heuristic guidance

Core Framework Classification

1. Model Generation

Extracting and formalizing planning specifications from natural language input, divided into three subcategories:

1.1 Task Modeling

  • Goal Specification Methods:
    • Few-shot prompting (Collins et al., 2022; Grover & Mohan, 2024)
    • Chain-of-Thought (CoT) prompting (Lyu et al., 2023)
    • Handling varying degrees of ambiguity (Xie et al., 2023)
  • Complete Task Specification:
    • Open-loop Systems: LLM+P uses contextual examples to generate complete PDDL problem files
    • Closed-loop Systems: Auto-GPT+P generates initial states based on visual perception with automatic error correction loops
    • Multi-agent Collaboration: DaTAPlan, PlanCollabNL, TwoStep, LaMMA-P
  • Alternative Representations:
    • Geometric representations for task and motion planning
    • Temporal logic (TSL, STL, LTL)
    • Python function definitions for search space

1.2 Domain Modeling

  • Single-Query Methods:
    • CLLaMP: Extracting PDDL action models from CVE descriptions
    • PROC2PDDL: Zone of Proximal Development prompt design
    • Candidate filtering methods (Huang et al., 2024b; Athalye et al., 2024)
  • Iterative Generation Methods:
    • LLM+DM: Adopts "generate-test-critique" approach, incrementally constructing domain components through multiple LLM calls
    • LLM+AL: Generating BC+ action language
    • LAMP: Algorithm family for learning abstract PDDL domain models
  • Closed-Loop Frameworks:
    • ADA: Generates candidate symbolic task decompositions, iteratively prompts undefined actions
    • COWP: Handles unexpected situations in open-world planning
    • LASP: Identifies potential errors from environment observations

1.3 Hybrid Modeling Complete model generation combining PDDL domain and problem systems:

  • Foundational Methods: Kelly et al. (2023) extracts narrative planning from input stories, iteratively handling planner error messages
  • Intermediate Representation Methods:
    • NL2Plan: First domain-independent offline end-to-end NL planning system
    • JSON token generation, consistency checking, and error correction loops
    • Reachability and dependency analysis
  • Practical Applications:
    • MORPHeus: Human-machine collaborative long-horizon planning with anomaly detection
    • InterPret: Learning PDDL predicates through interactive user language feedback
    • AgentGen: Synthesizing diverse PDDL tasks using LLMs for training

2. Model Editing

LLMs as auxiliary tools rather than fully autonomous generation solutions:

  • Gragera & Pozanco (2023): Investigate limitations of LLMs in fixing unsolvable tasks
  • Patil (2024): LLMs excel at syntax correction but are unreliable on semantic inconsistencies
  • Sikes et al. (2024a): Address semantically equivalent but syntactically different state variable issues
  • Caglar et al. (2024): Evaluate effectiveness of LLMs generating reasonable model edits

3. Model Benchmarks

Evaluating LLM capabilities and quality of generated planning specifications:

3.1 LLMs-as-Planner Benchmarks:

  • Mystery Blocksworld: Obfuscates classical Blocksworld to detect training data leakage
  • ALFWorld & Household: Real household environments using PDDL semantics
  • TravelPlanner & Natural Plan: Travel planning and realistic scheduling benchmarks
  • PlanBench: Systematic evaluation of cost-optimal planning and plan verification
  • ACPBench: Standardized evaluation tasks and metrics covering 13 domains and 22 SOTA models

3.2 LLMs-as-Planning-Formalizers Benchmarks:

  • Planetarium: Evaluates PDDL tasks/problems generated by LLMs, emphasizing two key issues:
    • LLMs may produce valid code inconsistent with original NL descriptions
    • NL descriptions in evaluation sets too similar to ground truth
  • Text2World:
    • Automated domain extraction pipeline
    • Multi-criteria metrics: executability, structural similarity, component-level F1 scores
    • Limitation: relies on executability as gating metric

Technical Innovations

  1. LLM-Modulo Framework: Ensures correctness through iterative plan refinement with external validators, shifting focus from direct planning to PDDL generation with integrated validators
  2. Intermediate Representations: Using ASP, Python, JSON, and other intermediate representations more amenable to LLMs, then converting to PDDL
  3. Multi-Candidate Generation: Generating multiple candidate domains or specific components (e.g., predicate definitions) to better accommodate ambiguity and uncertainty in user intent
  4. Human-Machine Collaboration: Enhancing model quality through preprocessing steps and human-machine interaction feedback loops
  5. Modular Design: Supporting dynamic integration of types and predicates, enabling more adaptive and fault-tolerant planning systems in later generation stages

Experimental Setup

Datasets

As a survey paper, this work covers diverse datasets and domains used across approximately 80 research works:

Classical Planning Domains:

  • Blocksworld
  • Gripper
  • Logistics
  • Floor Tile

Real-World Environments:

  • ALFWorld: Household environment interaction
  • Household: Typical household scenarios
  • TravelPlanner: Travel planning scenarios

Specialized Domains:

  • CVE (Common Vulnerabilities and Exposures): Cybersecurity
  • Emergency Operation Plans (EOPs): Emergency decision-making

Evaluation Metrics

Planning Quality Metrics:

  • Plan correctness
  • Cost optimality
  • Executability

Model Quality Metrics:

  • Structural Similarity: Structural comparison with ground truth
  • Component-level F1 Score: Precision and recall of predicates, actions, and other components
  • Operational Equivalence: Whether reconstructed domain behaves identically to original
  • Semantic Correctness: Whether generated code aligns with original NL description

System Performance Metrics:

  • Generation success rate
  • Number of iterations
  • Human intervention requirements

Comparison Methods

Main method categories covered in the survey:

  1. Direct Generation Methods: Single LLM call generating complete PDDL
  2. Iterative Refinement Methods: Multiple calls and feedback loops
  3. Hybrid Methods: Combining LLM and traditional validation tools
  4. Fine-tuning Methods: Fine-tuning LLMs on specific datasets

Experimental Results

Main Findings

1. Task Modeling is Relatively Straightforward

  • Highly explicit descriptions significantly improve translation accuracy (Liu et al., 2023a)
  • Using few-shot examples and reasoning chains enhance goal specification (Lyu et al., 2023)
  • TIC achieves near 100% accuracy on GPT-3.5 Turbo using intermediate representations in LLM+P planning domain

2. Domain Modeling is More Challenging

  • Single-pass generation of fully functional PDDL domains is impractical (Kambhampati et al., 2024)
  • Iterative methods (e.g., LLM+DM's "generate-test-critique") significantly improve quality
  • Contextual examples outperform CoT prompting (Oates et al., 2024)
  • Multi-candidate generation methods better accommodate ambiguity in user intent

3. Hybrid Modeling Complexity

  • Complexity emerges when coordinating domains with corresponding problems
  • Linear pipelines risk cascading errors
  • Preprocessing steps (using external tools like FastDownward, VAL) improve success rates
  • Human-machine collaboration significantly enhances model quality

4. Model Editing Effectiveness

  • LLMs excel at syntax correction
  • Less reliable on semantic inconsistencies (Patil, 2024)
  • Requires developing post-hoc correction strategies

5. Benchmarking Challenges

  • Training data leakage is a major issue (Hu et al., 2025 report high contamination rates)
  • Dynamic benchmark standards needed
  • NL description similarity to ground truth in evaluation sets affects assessment difficulty

Case Analysis

"Action-by-Action" Algorithm Reproduction (Guan et al., 2023) using L2P Library

The paper demonstrates how to reproduce predicate and action generation for the Logistics domain using L2P library:

Generated Predicate Examples:

(truck-at ?t - truck ?l - location): Truck ?t is currently at location ?l
(package-at ?p - package ?l - location): Package ?p is currently at location ?l
(truck-holding ?t - truck ?p - package): Truck ?t is currently holding package ?p
(plane-at ?a - plane ?l - location): Plane ?a is at location ?l

Generated Action Examples:

load_truck(?p - package, ?t - truck, ?l - location)
  Preconditions: (truck-at ?t ?l) ∧ (package-at ?p ?l) ∧ (truck-has-space ?t)
  Effects: ¬(package-at ?p ?l) ∧ (truck-holding ?t ?p)

Experimental Observations

  1. Prompt Sensitivity: LLMs are highly sensitive to prompt design, requiring standardized prompt granularity
  2. Value of Intermediate Representations: Using JSON, Python, and other intermediate representations improves accuracy and consistency
  3. Importance of Validators: Integrating external validation tools (VAL, FastDownward, etc.) is critical for ensuring quality
  4. Role of Domain Knowledge: Explicit predicate set specifications are crucial for consistent evaluation across different methods
  5. Necessity of Human-Machine Collaboration: Complex domains typically require human-machine interaction to ensure alignment

1. Other LLM+Planning Paradigms

LLMs-as-Planners:

  • Direct action sequence generation (Zhang et al., 2024c; Lin et al., 2023)
  • Plan refinement through post-hoc methods (Gundawar et al., 2024)
  • Limitations: Cannot guarantee correctness and optimality

LLMs-as-Heuristics:

  • Enhancing search efficiency through heuristic guidance (Silver et al., 2022; Hirsch et al., 2024)
  • Providing search direction but not directly generating plans
  • Huang et al. (2024c): High-level abstraction of LLM-enhanced planning agents
  • Pallagani et al. (2024): Broader construction beyond traditional AP
  • Zhao et al. (2024): Comprehensive overview of LLM-TAMP applications
  • Li et al. (2024a): Primarily focuses on LLMs-as-Planners, complementary to this work

3. Classical Planning Model Acquisition

  • Traditional approaches rely on manual expert knowledge engineering
  • Learning methods extract models from demonstrations
  • LLM methods discussed in this paper provide new automation pathways

Conclusions and Discussion

Main Conclusions

  1. LLMs-as-Formalizers is a Promising Paradigm: Combining LLMs' natural language understanding with classical planners' structured reasoning capabilities
  2. Task Modeling is Relatively Mature: Existing methods can effectively generate task specifications under explicit descriptions
  3. Domain Modeling Remains Challenging: Requires iterative methods, multi-candidate generation, and external validation
  4. Hybrid Modeling Needs Systematic Approaches: Modular design and error tolerance mechanisms are critical
  5. Benchmarking Requires Continuous Improvement: Data leakage and evaluation standardization are key issues

Limitations

  1. Survey Scope:
    • Primarily focuses on PDDL construction frameworks
    • Technical analysis of each work is brief due to space constraints
    • May miss relevant research from other conferences/journals
  2. Current L2P Library Limitations:
    • Only supports basic PDDL extraction tools for fully observable deterministic planning
    • Does not yet include tools for advanced domains like temporal planning
  3. Method Limitations:
    • Most methods rely on explicit NL-to-PDDL code mapping
    • Limited ability to infer complete specifications from sparse input
    • Semantic error handling remains difficult

Future Directions

Addressing RQ1 (Goal Alignment):

  1. Enhanced Interpretability: Develop interpretable planning systems producing robust, transparent, and correctable outputs
  2. Correction Feedback Loops: Improve mechanisms for handling action precondition errors and execution failures
  3. Human-Machine Collaboration: Ensure alignment through preprocessing steps and human feedback loops
  4. Semantic Correctness Verification: Analyze semantic correctness of generated plans as feedback for refining PDDL specifications

Addressing RQ2 (Description Granularity):

  1. Minimal Description Handling: Develop methods inferring complete PDDL specifications from sparse input
  2. Common Sense Reasoning Integration: Leverage LLMs' common sense to capture potential assumptions and constraints
  3. Standardized Prompting: Establish standardized prompt granularity for initial generation and iterative feedback
  4. Automatic Description Generation: Develop tools automatically generating PDDL descriptions (e.g., Nabizada et al., 2024)

Technical Directions:

  1. Modular Architecture: Support more adaptive systems with dynamic type and predicate integration
  2. Multi-Candidate Strategies: Generate and evaluate multiple candidate models to handle uncertainty
  3. Post-hoc Correction: Identify semantic inconsistencies through automatic metrics or human evaluation
  4. Dynamic Benchmarking: Establish community-driven dynamic benchmark standards preventing data leakage
  5. Extension to Advanced Planning: Extend methods to temporal planning, probabilistic planning, etc.

Application Directions:

  1. Practical Deployment: Test in real scenarios including robotics, game AI, emergency response
  2. Domain Transfer: Improve cross-domain generalization capabilities
  3. Multimodal Integration: Combine visual, linguistic, and other modality information

In-Depth Evaluation

Strengths

  1. Comprehensiveness and Systematicity:
    • First comprehensive survey focusing on LLMs-as-Formalizers paradigm
    • Covers approximately 80 related works with clear classification
    • Provides complete perspective from task modeling to domain modeling to hybrid modeling
  2. High Practical Value:
    • Provides open-source L2P library implementing multiple landmark methods
    • Modular design enables researchers to quickly experiment and compare
    • Includes detailed code examples and usage instructions
  3. Problem-Oriented:
    • Clearly articulates RQ1 and RQ2 core research questions
    • Each subdomain provides "Summary and Future Directions"
    • Provides clear roadmap for future research
  4. Technical Depth:
    • Detailed analysis of various method technical details
    • Compares different prompting strategies, feedback mechanisms, and validation methods
    • Provides PDDL fundamentals and Blocksworld examples
  5. Critical Thinking:
    • Objectively identifies method limitations
    • Discusses key issues including data leakage and evaluation standards
    • Emphasizes distinction between semantic and syntactic correctness

Weaknesses

  1. Limited Empirical Analysis:
    • As a survey paper, lacks systematic experimental comparison under unified framework
    • Different methods use different datasets and metrics, hindering direct comparison
    • No quantitative performance comparison table provided
  2. L2P Library Maturity:
    • Currently reproduces only partial landmark methods
    • Supports only basic PDDL, not temporal or probabilistic features
    • Requires continued community contributions for maintenance
  3. Insufficient Theoretical Analysis:
    • Lacks theoretical explanation for why LLMs fail on certain planning tasks
    • Limited analysis of differences across architectures (GPT vs LLaMA, etc.)
    • Limited theoretical foundation discussion for prompt engineering
  4. Evaluation Methodology:
    • Despite discussing benchmarks, lacks unified evaluation framework
    • Lacks clear definition of "what constitutes good PDDL models"
    • Insufficient detail on human evaluation standards and procedures
  5. Application Scenario Discussion:
    • Limited discussion of practical deployment challenges (computational cost, latency, etc.)
    • Lacks scenario-specific analysis (robotics, games, scheduling, etc.)
    • Insufficient discussion of industrial adoption barriers and solutions

Impact

  1. Academic Contribution:
    • Bridges NLP and AI planning communities
    • Clearly defines LLMs-as-Formalizers paradigm, contrasting with other paradigms
    • Establishes systematic taxonomy and terminology for the field
  2. Practical Value:
    • L2P library lowers research barriers, promoting reproducibility
    • Provides researchers rapid prototyping tools
    • May accelerate research progress in LLM+planning field
  3. Community Building:
    • Integrates dispersed literature providing unified perspective
    • Identifies key challenges and research gaps
    • May inspire new research directions and collaborations
  4. Potential Impact:
    • Likely to become standard reference for the field
    • L2P library has potential to become community standard tool
    • Proposed research questions may guide research for years

Applicable Scenarios

  1. Researchers:
    • Entry guide for LLM+planning field
    • Finding research gaps and future directions
    • Comparing and evaluating different methods
  2. Engineers:
    • Selecting appropriate LLM+planning methods for specific applications
    • Rapid prototyping using L2P library
    • Understanding method tradeoffs and applicable scenarios
  3. Educational Use:
    • Textbook material for LLM+planning courses
    • Rich literature and code resources
    • Clear PDDL introductory examples
  4. Specific Application Domains:
    • Robotics: Generating robot task planning from natural language instructions
    • Game AI: Generating NPC behavior planning models
    • Emergency Response: Generating emergency operation plans from policy documents
    • Logistics: Generating scheduling and routing planning from business descriptions

References

This survey covers approximately 80 related works. Key references include:

Foundational Methods:

  • Liu et al. (2023a): LLM+P - Enhancing LLMs with optimal planning capabilities
  • Guan et al. (2023): LLM+DM - Constructing world models using pretrained LLMs
  • Kambhampati et al. (2024): LLM-Modulo framework - LLMs cannot plan but can help planning

Benchmarks:

  • Valmeekam et al. (2023a): PlanBench - Evaluating LLM planning capabilities
  • Zuo et al. (2024): Planetarium - Evaluating PDDL problem generation
  • Hu et al. (2025): Text2World - Domain generation benchmark

Domain Modeling:

  • Wong et al. (2023): ADA - Action domain acquisition
  • Oswald et al. (2024): Operational equivalence evaluation
  • Zhang et al. (2024b): PROC2PDDL - Text to PDDL

Application Systems:

  • Gestrin et al. (2024): NL2Plan - Domain-independent end-to-end system
  • Kelly et al. (2023): PDDL extraction for narrative planning
  • Ye et al. (2024): MORPHeus - Human-machine collaborative long-horizon planning

Overall Assessment: This is a high-quality, timely, and practical survey paper systematically reviewing the state of research on LLMs as planning formalization tools. The paper features clear classification, in-depth analysis, and particularly valuable open-source L2P library contribution, making it not merely a literature review but an actionable research tool. While there is room for improvement in empirical comparison and theoretical analysis, as the field's first comprehensive survey, it possesses high academic and practical value, likely becoming an important reference for the LLM+automated planning field.