2025-11-28T21:52:20.176299

LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models

Tantakoun, Zhu, Muise

Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems requiring structured reasoning. This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities. However, identifying optimal AP deployment frameworks can be daunting and introduces new challenges. This paper aims to provide a timely survey of the current research with an in-depth analysis, positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf AP planners. By systematically reviewing the current state of research, we highlight methodologies, and identify critical challenges and future directions, hoping to contribute to the joint research on NLP and Automated Planning.

academic

LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models

Basic Information

Paper ID: 2503.18971
Title: LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models
Authors: Marcus Tantakoun, Christian Muise, Xiaodan Zhu (Queen's University)
Classification: cs.AI
Publication Date: March 2025 (arXiv v2: October 25, 2025)
Paper Link: https://arxiv.org/abs/2503.18971v2

Abstract

Large Language Models (LLMs) demonstrate exceptional performance on various natural language tasks but continue to struggle with long-horizon planning problems requiring structured reasoning. This paper provides a timely survey that systematically analyzes the current state of research positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf automated planning (AP) systems. Through systematic review of approximately 80 related works, the paper highlights methodologies, identifies key challenges and future directions, and provides an open-source Python library Language-to-Plan (L2P) to facilitate research in this field.

Research Background and Motivation

1. Core Problem

Despite LLMs' superior performance on natural language processing tasks, they perform poorly on long-horizon planning and reasoning tasks, frequently producing unreliable plans. Using LLMs directly as planners (LLM-as-Planner) cannot guarantee the correctness, optimality, and reliability of outputs.

2. Problem Significance

Nature of Planning: Planning is a crucial component of System II cognition, requiring structured reasoning, whereas LLMs excel at System I tasks
Practical Application Bottleneck: Extracting planning models has long been a major obstacle to widespread application of planning technology
Reliability Requirements: Practical applications require verifiable, interpretable, and robust planning solutions

3. Limitations of Existing Approaches

Direct Planning Methods: When LLMs directly generate action sequences, performance degrades with iterative feedback
Lack of Structured Guarantees: LLMs cannot provide correctness guarantees like classical planning systems
Long-term Dependency Issues: As scale increases, LLMs frequently fail to consider action effects and preconditions

4. Research Motivation

This paper proposes the LLMs-as-Formalizers paradigm: leveraging LLMs' strengths (extracting, interpreting, and refining planning model specifications from natural language) combined with classical automated planning systems' strengths (structured representations, logic, and search methods) to construct complementary neuro-symbolic frameworks.

Core Contributions

Systematic Taxonomy: Proposes the first comprehensive classification system for LLM-driven automated planning model construction, including:
- Model Generation: task modeling, domain modeling, hybrid modeling
- Model Editing: code refinement and error correction
- Model Benchmarks: evaluation frameworks and datasets
Technical Methods Summary: Systematically reviews shared and innovative technical methods for integrating LLMs into AI planning frameworks and their limitations
Research Question Framework: Proposes two core research questions (RQ):
- RQ1: How can LLMs accurately align with human objectives to ensure planning model specifications correctly represent desired expectations and goals?
- RQ2: To what extent and granularity can natural language instructions be effectively converted into accurate planning model definitions?
Open-Source Toolkit: Provides Language-to-Plan (L2P) open-source Python library implementing landmark paper methods covered in the survey, supporting:
- Comprehensive PDDL extraction and refinement tool suite
- Modular design supporting flexible prompting styles and custom pipelines
- Fully autonomous end-to-end pipeline capabilities
Future Direction Guidance: Identifies key challenges and outlines future research directions for the field

Methodology Details

Task Definition

This survey focuses on the LLMs-as-Formalizers paradigm, using LLMs to construct automated planning (AP) model specifications (primarily in PDDL format), which are then solved by domain-independent planners. This contrasts with:

LLMs-as-Planners: LLMs directly generate action sequences
LLMs-as-Heuristics: LLMs enhance search efficiency through heuristic guidance

Core Framework Classification

1. Model Generation

Extracting and formalizing planning specifications from natural language input, divided into three subcategories:

1.1 Task Modeling

Goal Specification Methods:
- Few-shot prompting (Collins et al., 2022; Grover & Mohan, 2024)
- Chain-of-Thought (CoT) prompting (Lyu et al., 2023)
- Handling varying degrees of ambiguity (Xie et al., 2023)
Complete Task Specification:
- Open-loop Systems: LLM+P uses contextual examples to generate complete PDDL problem files
- Closed-loop Systems: Auto-GPT+P generates initial states based on visual perception with automatic error correction loops
- Multi-agent Collaboration: DaTAPlan, PlanCollabNL, TwoStep, LaMMA-P
Alternative Representations:
- Geometric representations for task and motion planning
- Temporal logic (TSL, STL, LTL)
- Python function definitions for search space

1.2 Domain Modeling

Single-Query Methods:
- CLLaMP: Extracting PDDL action models from CVE descriptions
- PROC2PDDL: Zone of Proximal Development prompt design
- Candidate filtering methods (Huang et al., 2024b; Athalye et al., 2024)
Iterative Generation Methods:
- LLM+DM: Adopts "generate-test-critique" approach, incrementally constructing domain components through multiple LLM calls
- LLM+AL: Generating BC+ action language
- LAMP: Algorithm family for learning abstract PDDL domain models
Closed-Loop Frameworks:
- ADA: Generates candidate symbolic task decompositions, iteratively prompts undefined actions
- COWP: Handles unexpected situations in open-world planning
- LASP: Identifies potential errors from environment observations

1.3 Hybrid Modeling Complete model generation combining PDDL domain and problem systems:

Foundational Methods: Kelly et al. (2023) extracts narrative planning from input stories, iteratively handling planner error messages
Intermediate Representation Methods:
- NL2Plan: First domain-independent offline end-to-end NL planning system
- JSON token generation, consistency checking, and error correction loops
- Reachability and dependency analysis
Practical Applications:
- MORPHeus: Human-machine collaborative long-horizon planning with anomaly detection
- InterPret: Learning PDDL predicates through interactive user language feedback
- AgentGen: Synthesizing diverse PDDL tasks using LLMs for training

2. Model Editing

LLMs as auxiliary tools rather than fully autonomous generation solutions:

Gragera & Pozanco (2023): Investigate limitations of LLMs in fixing unsolvable tasks
Patil (2024): LLMs excel at syntax correction but are unreliable on semantic inconsistencies
Sikes et al. (2024a): Address semantically equivalent but syntactically different state variable issues
Caglar et al. (2024): Evaluate effectiveness of LLMs generating reasonable model edits

3. Model Benchmarks

Evaluating LLM capabilities and quality of generated planning specifications:

3.1 LLMs-as-Planner Benchmarks:

Mystery Blocksworld: Obfuscates classical Blocksworld to detect training data leakage
ALFWorld & Household: Real household environments using PDDL semantics
TravelPlanner & Natural Plan: Travel planning and realistic scheduling benchmarks
PlanBench: Systematic evaluation of cost-optimal planning and plan verification
ACPBench: Standardized evaluation tasks and metrics covering 13 domains and 22 SOTA models

3.2 LLMs-as-Planning-Formalizers Benchmarks:

Planetarium: Evaluates PDDL tasks/problems generated by LLMs, emphasizing two key issues:
- LLMs may produce valid code inconsistent with original NL descriptions
- NL descriptions in evaluation sets too similar to ground truth
Text2World:
- Automated domain extraction pipeline
- Multi-criteria metrics: executability, structural similarity, component-level F1 scores
- Limitation: relies on executability as gating metric

Technical Innovations

LLM-Modulo Framework: Ensures correctness through iterative plan refinement with external validators, shifting focus from direct planning to PDDL generation with integrated validators
Intermediate Representations: Using ASP, Python, JSON, and other intermediate representations more amenable to LLMs, then converting to PDDL
Multi-Candidate Generation: Generating multiple candidate domains or specific components (e.g., predicate definitions) to better accommodate ambiguity and uncertainty in user intent
Human-Machine Collaboration: Enhancing model quality through preprocessing steps and human-machine interaction feedback loops
Modular Design: Supporting dynamic integration of types and predicates, enabling more adaptive and fault-tolerant planning systems in later generation stages

Experimental Setup

Datasets

As a survey paper, this work covers diverse datasets and domains used across approximately 80 research works:

Classical Planning Domains:

Blocksworld
Gripper
Logistics
Floor Tile

Real-World Environments:

ALFWorld: Household environment interaction
Household: Typical household scenarios
TravelPlanner: Travel planning scenarios

Specialized Domains:

CVE (Common Vulnerabilities and Exposures): Cybersecurity
Emergency Operation Plans (EOPs): Emergency decision-making

Evaluation Metrics

Planning Quality Metrics:

Plan correctness
Cost optimality
Executability

Model Quality Metrics:

Structural Similarity: Structural comparison with ground truth
Component-level F1 Score: Precision and recall of predicates, actions, and other components
Operational Equivalence: Whether reconstructed domain behaves identically to original
Semantic Correctness: Whether generated code aligns with original NL description

System Performance Metrics:

Generation success rate
Number of iterations
Human intervention requirements

Comparison Methods

Main method categories covered in the survey:

Direct Generation Methods: Single LLM call generating complete PDDL
Iterative Refinement Methods: Multiple calls and feedback loops
Hybrid Methods: Combining LLM and traditional validation tools
Fine-tuning Methods: Fine-tuning LLMs on specific datasets

Experimental Results

Main Findings

1. Task Modeling is Relatively Straightforward

Highly explicit descriptions significantly improve translation accuracy (Liu et al., 2023a)
Using few-shot examples and reasoning chains enhance goal specification (Lyu et al., 2023)
TIC achieves near 100% accuracy on GPT-3.5 Turbo using intermediate representations in LLM+P planning domain

2. Domain Modeling is More Challenging

Single-pass generation of fully functional PDDL domains is impractical (Kambhampati et al., 2024)
Iterative methods (e.g., LLM+DM's "generate-test-critique") significantly improve quality
Contextual examples outperform CoT prompting (Oates et al., 2024)
Multi-candidate generation methods better accommodate ambiguity in user intent

3. Hybrid Modeling Complexity

Complexity emerges when coordinating domains with corresponding problems
Linear pipelines risk cascading errors
Preprocessing steps (using external tools like FastDownward, VAL) improve success rates
Human-machine collaboration significantly enhances model quality

4. Model Editing Effectiveness

LLMs excel at syntax correction
Less reliable on semantic inconsistencies (Patil, 2024)
Requires developing post-hoc correction strategies

5. Benchmarking Challenges

Training data leakage is a major issue (Hu et al., 2025 report high contamination rates)
Dynamic benchmark standards needed
NL description similarity to ground truth in evaluation sets affects assessment difficulty

Case Analysis

"Action-by-Action" Algorithm Reproduction (Guan et al., 2023) using L2P Library

The paper demonstrates how to reproduce predicate and action generation for the Logistics domain using L2P library:

Generated Predicate Examples:

(truck-at ?t - truck ?l - location): Truck ?t is currently at location ?l
(package-at ?p - package ?l - location): Package ?p is currently at location ?l
(truck-holding ?t - truck ?p - package): Truck ?t is currently holding package ?p
(plane-at ?a - plane ?l - location): Plane ?a is at location ?l

Generated Action Examples:

load_truck(?p - package, ?t - truck, ?l - location)
  Preconditions: (truck-at ?t ?l) ∧ (package-at ?p ?l) ∧ (truck-has-space ?t)
  Effects: ¬(package-at ?p ?l) ∧ (truck-holding ?t ?p)

Experimental Observations

Prompt Sensitivity: LLMs are highly sensitive to prompt design, requiring standardized prompt granularity
Value of Intermediate Representations: Using JSON, Python, and other intermediate representations improves accuracy and consistency
Importance of Validators: Integrating external validation tools (VAL, FastDownward, etc.) is critical for ensuring quality
Role of Domain Knowledge: Explicit predicate set specifications are crucial for consistent evaluation across different methods
Necessity of Human-Machine Collaboration: Complex domains typically require human-machine interaction to ensure alignment

1. Other LLM+Planning Paradigms

LLMs-as-Planners:

Direct action sequence generation (Zhang et al., 2024c; Lin et al., 2023)
Plan refinement through post-hoc methods (Gundawar et al., 2024)
Limitations: Cannot guarantee correctness and optimality

LLMs-as-Heuristics:

Enhancing search efficiency through heuristic guidance (Silver et al., 2022; Hirsch et al., 2024)
Providing search direction but not directly generating plans

Huang et al. (2024c): High-level abstraction of LLM-enhanced planning agents
Pallagani et al. (2024): Broader construction beyond traditional AP
Zhao et al. (2024): Comprehensive overview of LLM-TAMP applications
Li et al. (2024a): Primarily focuses on LLMs-as-Planners, complementary to this work

3. Classical Planning Model Acquisition

Traditional approaches rely on manual expert knowledge engineering
Learning methods extract models from demonstrations
LLM methods discussed in this paper provide new automation pathways

Conclusions and Discussion

Main Conclusions

LLMs-as-Formalizers is a Promising Paradigm: Combining LLMs' natural language understanding with classical planners' structured reasoning capabilities
Task Modeling is Relatively Mature: Existing methods can effectively generate task specifications under explicit descriptions
Domain Modeling Remains Challenging: Requires iterative methods, multi-candidate generation, and external validation
Hybrid Modeling Needs Systematic Approaches: Modular design and error tolerance mechanisms are critical
Benchmarking Requires Continuous Improvement: Data leakage and evaluation standardization are key issues

Limitations

Survey Scope:
- Primarily focuses on PDDL construction frameworks
- Technical analysis of each work is brief due to space constraints
- May miss relevant research from other conferences/journals
Current L2P Library Limitations:
- Only supports basic PDDL extraction tools for fully observable deterministic planning
- Does not yet include tools for advanced domains like temporal planning
Method Limitations:
- Most methods rely on explicit NL-to-PDDL code mapping
- Limited ability to infer complete specifications from sparse input
- Semantic error handling remains difficult

Future Directions

Addressing RQ1 (Goal Alignment):

Enhanced Interpretability: Develop interpretable planning systems producing robust, transparent, and correctable outputs
Correction Feedback Loops: Improve mechanisms for handling action precondition errors and execution failures
Human-Machine Collaboration: Ensure alignment through preprocessing steps and human feedback loops
Semantic Correctness Verification: Analyze semantic correctness of generated plans as feedback for refining PDDL specifications

Addressing RQ2 (Description Granularity):

Minimal Description Handling: Develop methods inferring complete PDDL specifications from sparse input
Common Sense Reasoning Integration: Leverage LLMs' common sense to capture potential assumptions and constraints
Standardized Prompting: Establish standardized prompt granularity for initial generation and iterative feedback
Automatic Description Generation: Develop tools automatically generating PDDL descriptions (e.g., Nabizada et al., 2024)

Technical Directions:

Modular Architecture: Support more adaptive systems with dynamic type and predicate integration
Multi-Candidate Strategies: Generate and evaluate multiple candidate models to handle uncertainty
Post-hoc Correction: Identify semantic inconsistencies through automatic metrics or human evaluation
Dynamic Benchmarking: Establish community-driven dynamic benchmark standards preventing data leakage
Extension to Advanced Planning: Extend methods to temporal planning, probabilistic planning, etc.

Application Directions:

Practical Deployment: Test in real scenarios including robotics, game AI, emergency response
Domain Transfer: Improve cross-domain generalization capabilities
Multimodal Integration: Combine visual, linguistic, and other modality information

In-Depth Evaluation

Strengths

Comprehensiveness and Systematicity:
- First comprehensive survey focusing on LLMs-as-Formalizers paradigm
- Covers approximately 80 related works with clear classification
- Provides complete perspective from task modeling to domain modeling to hybrid modeling
High Practical Value:
- Provides open-source L2P library implementing multiple landmark methods
- Modular design enables researchers to quickly experiment and compare
- Includes detailed code examples and usage instructions
Problem-Oriented:
- Clearly articulates RQ1 and RQ2 core research questions
- Each subdomain provides "Summary and Future Directions"
- Provides clear roadmap for future research
Technical Depth:
- Detailed analysis of various method technical details
- Compares different prompting strategies, feedback mechanisms, and validation methods
- Provides PDDL fundamentals and Blocksworld examples
Critical Thinking:
- Objectively identifies method limitations
- Discusses key issues including data leakage and evaluation standards
- Emphasizes distinction between semantic and syntactic correctness

Weaknesses

Limited Empirical Analysis:
- As a survey paper, lacks systematic experimental comparison under unified framework
- Different methods use different datasets and metrics, hindering direct comparison
- No quantitative performance comparison table provided
L2P Library Maturity:
- Currently reproduces only partial landmark methods
- Supports only basic PDDL, not temporal or probabilistic features
- Requires continued community contributions for maintenance
Insufficient Theoretical Analysis:
- Lacks theoretical explanation for why LLMs fail on certain planning tasks
- Limited analysis of differences across architectures (GPT vs LLaMA, etc.)
- Limited theoretical foundation discussion for prompt engineering
Evaluation Methodology:
- Despite discussing benchmarks, lacks unified evaluation framework
- Lacks clear definition of "what constitutes good PDDL models"
- Insufficient detail on human evaluation standards and procedures
Application Scenario Discussion:
- Limited discussion of practical deployment challenges (computational cost, latency, etc.)
- Lacks scenario-specific analysis (robotics, games, scheduling, etc.)
- Insufficient discussion of industrial adoption barriers and solutions

Impact

Academic Contribution:
- Bridges NLP and AI planning communities
- Clearly defines LLMs-as-Formalizers paradigm, contrasting with other paradigms
- Establishes systematic taxonomy and terminology for the field
Practical Value:
- L2P library lowers research barriers, promoting reproducibility
- Provides researchers rapid prototyping tools
- May accelerate research progress in LLM+planning field
Community Building:
- Integrates dispersed literature providing unified perspective
- Identifies key challenges and research gaps
- May inspire new research directions and collaborations
Potential Impact:
- Likely to become standard reference for the field
- L2P library has potential to become community standard tool
- Proposed research questions may guide research for years

Applicable Scenarios

Researchers:
- Entry guide for LLM+planning field
- Finding research gaps and future directions
- Comparing and evaluating different methods
Engineers:
- Selecting appropriate LLM+planning methods for specific applications
- Rapid prototyping using L2P library
- Understanding method tradeoffs and applicable scenarios
Educational Use:
- Textbook material for LLM+planning courses
- Rich literature and code resources
- Clear PDDL introductory examples
Specific Application Domains:
- Robotics: Generating robot task planning from natural language instructions
- Game AI: Generating NPC behavior planning models
- Emergency Response: Generating emergency operation plans from policy documents
- Logistics: Generating scheduling and routing planning from business descriptions

References

This survey covers approximately 80 related works. Key references include:

Foundational Methods:

Liu et al. (2023a): LLM+P - Enhancing LLMs with optimal planning capabilities
Guan et al. (2023): LLM+DM - Constructing world models using pretrained LLMs
Kambhampati et al. (2024): LLM-Modulo framework - LLMs cannot plan but can help planning

Benchmarks:

Valmeekam et al. (2023a): PlanBench - Evaluating LLM planning capabilities
Zuo et al. (2024): Planetarium - Evaluating PDDL problem generation
Hu et al. (2025): Text2World - Domain generation benchmark

Domain Modeling:

Wong et al. (2023): ADA - Action domain acquisition
Oswald et al. (2024): Operational equivalence evaluation
Zhang et al. (2024b): PROC2PDDL - Text to PDDL

Application Systems:

Gestrin et al. (2024): NL2Plan - Domain-independent end-to-end system
Kelly et al. (2023): PDDL extraction for narrative planning
Ye et al. (2024): MORPHeus - Human-machine collaborative long-horizon planning

Overall Assessment: This is a high-quality, timely, and practical survey paper systematically reviewing the state of research on LLMs as planning formalization tools. The paper features clear classification, in-depth analysis, and particularly valuable open-source L2P library contribution, making it not merely a literature review but an actionable research tool. While there is room for improvement in empirical comparison and theoretical analysis, as the field's first comprehensive survey, it possesses high academic and practical value, likely becoming an important reference for the LLM+automated planning field.