LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models
Tantakoun, Zhu, Muise
Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems requiring structured reasoning. This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities. However, identifying optimal AP deployment frameworks can be daunting and introduces new challenges. This paper aims to provide a timely survey of the current research with an in-depth analysis, positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf AP planners. By systematically reviewing the current state of research, we highlight methodologies, and identify critical challenges and future directions, hoping to contribute to the joint research on NLP and Automated Planning.
academic
LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models
Large Language Models (LLMs) demonstrate exceptional performance on various natural language tasks but continue to struggle with long-horizon planning problems requiring structured reasoning. This paper provides a timely survey that systematically analyzes the current state of research positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf automated planning (AP) systems. Through systematic review of approximately 80 related works, the paper highlights methodologies, identifies key challenges and future directions, and provides an open-source Python library Language-to-Plan (L2P) to facilitate research in this field.
Despite LLMs' superior performance on natural language processing tasks, they perform poorly on long-horizon planning and reasoning tasks, frequently producing unreliable plans. Using LLMs directly as planners (LLM-as-Planner) cannot guarantee the correctness, optimality, and reliability of outputs.
This paper proposes the LLMs-as-Formalizers paradigm: leveraging LLMs' strengths (extracting, interpreting, and refining planning model specifications from natural language) combined with classical automated planning systems' strengths (structured representations, logic, and search methods) to construct complementary neuro-symbolic frameworks.
Systematic Taxonomy: Proposes the first comprehensive classification system for LLM-driven automated planning model construction, including:
Model Generation: task modeling, domain modeling, hybrid modeling
Model Editing: code refinement and error correction
Model Benchmarks: evaluation frameworks and datasets
Technical Methods Summary: Systematically reviews shared and innovative technical methods for integrating LLMs into AI planning frameworks and their limitations
Research Question Framework: Proposes two core research questions (RQ):
RQ1: How can LLMs accurately align with human objectives to ensure planning model specifications correctly represent desired expectations and goals?
RQ2: To what extent and granularity can natural language instructions be effectively converted into accurate planning model definitions?
Open-Source Toolkit: Provides Language-to-Plan (L2P) open-source Python library implementing landmark paper methods covered in the survey, supporting:
Comprehensive PDDL extraction and refinement tool suite
Modular design supporting flexible prompting styles and custom pipelines
Fully autonomous end-to-end pipeline capabilities
Future Direction Guidance: Identifies key challenges and outlines future research directions for the field
This survey focuses on the LLMs-as-Formalizers paradigm, using LLMs to construct automated planning (AP) model specifications (primarily in PDDL format), which are then solved by domain-independent planners. This contrasts with:
LLM-Modulo Framework: Ensures correctness through iterative plan refinement with external validators, shifting focus from direct planning to PDDL generation with integrated validators
Intermediate Representations: Using ASP, Python, JSON, and other intermediate representations more amenable to LLMs, then converting to PDDL
Multi-Candidate Generation: Generating multiple candidate domains or specific components (e.g., predicate definitions) to better accommodate ambiguity and uncertainty in user intent
Human-Machine Collaboration: Enhancing model quality through preprocessing steps and human-machine interaction feedback loops
Modular Design: Supporting dynamic integration of types and predicates, enabling more adaptive and fault-tolerant planning systems in later generation stages
This survey covers approximately 80 related works. Key references include:
Foundational Methods:
Liu et al. (2023a): LLM+P - Enhancing LLMs with optimal planning capabilities
Guan et al. (2023): LLM+DM - Constructing world models using pretrained LLMs
Kambhampati et al. (2024): LLM-Modulo framework - LLMs cannot plan but can help planning
Benchmarks:
Valmeekam et al. (2023a): PlanBench - Evaluating LLM planning capabilities
Zuo et al. (2024): Planetarium - Evaluating PDDL problem generation
Hu et al. (2025): Text2World - Domain generation benchmark
Domain Modeling:
Wong et al. (2023): ADA - Action domain acquisition
Oswald et al. (2024): Operational equivalence evaluation
Zhang et al. (2024b): PROC2PDDL - Text to PDDL
Application Systems:
Gestrin et al. (2024): NL2Plan - Domain-independent end-to-end system
Kelly et al. (2023): PDDL extraction for narrative planning
Ye et al. (2024): MORPHeus - Human-machine collaborative long-horizon planning
Overall Assessment: This is a high-quality, timely, and practical survey paper systematically reviewing the state of research on LLMs as planning formalization tools. The paper features clear classification, in-depth analysis, and particularly valuable open-source L2P library contribution, making it not merely a literature review but an actionable research tool. While there is room for improvement in empirical comparison and theoretical analysis, as the field's first comprehensive survey, it possesses high academic and practical value, likely becoming an important reference for the LLM+automated planning field.