Towards Autoformalization of LLM-generated Outputs for Requirement Verification
Gupte, S
Autoformalization, the process of translating informal statements into formal logic, has gained renewed interest with the emergence of powerful Large Language Models (LLMs). While LLMs show promise in generating structured outputs from natural language (NL), such as Gherkin Scenarios from NL feature requirements, there's currently no formal method to verify if these outputs are accurate. This paper takes a preliminary step toward addressing this gap by exploring the use of a simple LLM-based autoformalizer to verify LLM-generated outputs against a small set of natural language requirements. We conducted two distinct experiments. In the first one, the autoformalizer successfully identified that two differently-worded NL requirements were logically equivalent, demonstrating the pipeline's potential for consistency checks. In the second, the autoformalizer was used to identify a logical inconsistency between a given NL requirement and an LLM-generated output, highlighting its utility as a formal verification tool. Our findings, while limited, suggest that autoformalization holds significant potential for ensuring the fidelity and logical consistency of LLM-generated outputs, laying a crucial foundation for future, more extensive studies into this novel application.
academic
Towards Autoformalization of LLM-generated Outputs for Requirement Verification
This paper explores the feasibility of using autoformalization techniques to verify the accuracy of outputs generated by large language models (LLMs). As LLMs demonstrate potential in converting natural language requirements into structured outputs (such as Gherkin scenarios), the critical question of how to formally verify the accuracy of these outputs becomes paramount. The research team conducted two sets of experiments: the first successfully identified logical equivalence between two differently-worded natural language requirements; the second identified logical inconsistencies between LLM-generated outputs and original requirements. Although the research scope is limited, the results demonstrate significant potential for autoformalization in ensuring the fidelity and logical consistency of LLM-generated outputs.
With the proliferation of LLM applications, particularly in automated generation of test scenarios for engineering tasks, a critical challenge emerges: the lack of formal methods to verify whether LLM-generated outputs accurately reflect the original natural language requirements. For example, after generating a Gherkin scenario from a requirement such as "When vehicle speed ≥ 10 and seatbelt is unfastened, initiate seatbelt reminder," there is no guarantee of the logical correctness of the generated content.
This paper proposes applying autoformalization techniques to a novel scenario: verifying LLM outputs by converting both natural language requirements and LLM-generated outputs into formal logic representations (Lean 4) and utilizing theorem provers to verify logical equivalence.
Proposed the first autoformalization pipeline specifically designed for verifying LLM-generated outputs: Converting natural language requirements and LLM outputs into Lean 4 formal representations and verifying logical consistency through biconditional equivalence proofs
Validated two concrete application scenarios:
Identifying logical equivalence of differently-worded natural language requirements
Detecting logical inconsistencies between LLM-generated outputs and original requirements
Identified critical technical challenges:
The necessity and automation difficulties of variable grounding
The impact of LLM non-determinism on formalization
Handling natural language ambiguity
Proposed future research directions: Laying the foundation for constructing reliable LLM output verification frameworks
Cross-domain Application Innovation: First application of autoformalization techniques from mathematical theorem proving to software engineering requirement verification
Dual-layer LLM Usage:
First layer: Formalization translation (NL → Lean)
Second layer: Theorem proving (equivalence verification)
Failure-based Verification Mechanism: Utilizing theorem prover failure as an indicator of logical inconsistency, representing an innovative negative verification approach
Variable Grounding Identification: Explicitly proposing that variable grounding is an indispensable step in the autoformalization pipeline, which has not been adequately emphasized in prior research
R1 and R2 successfully converted to Lean propositions (Figures 3, 4)
Variable mappings manually established:
vehicle_speed_avg ≡ mean_vehicle_speed
seatbelt_active ≡ seatbelt_plugged_in
Verification Results (Figure 5):
✅ Lean compilation successful
✅ Theorem req1_eq_req2 proof successful
Conclusion: R1 and R2 are logically equivalent
Significance: Demonstrates that the pipeline can identify semantically identical but differently-worded requirements, facilitating requirement consistency checking.
Formalization Depends on Interpretability: LLM non-determinism causes the same requirement to potentially produce different but equally "reasonable" formalizations
Variable Grounding is a Bottleneck:
Most time-consuming step
Requires system domain knowledge
Currently only achievable manually
System Context is Critical: Lack of explicit system definitions (such as data dictionaries) leads to formalization inconsistencies
Negative Verification is Effective: Proof failure effectively indicates logical inconsistency
Mathematical Problem Solving: High school to undergraduate mathematics
Code Verification: Program correctness verification
Biomedical: Fact verification
This Paper's Contribution: First application of autoformalization to engineering requirement verification and LLM output verification, opening new application directions.
Feasibility Verification: The autoformalization pipeline can effectively verify logical consistency between LLM-generated outputs and original requirements
Weng et al. (2025): Autoformalization in the era of large language models: A survey - Survey literature
Wu et al. (2022): Autoformalization with large language models - Foundational autoformalization work
Ren et al. (2025): DeepSeek-Prover-v2 - Core model used in this paper
Jiang et al. (2022): Draft, sketch, and prove - Subgoal decomposition method
de Moura & Ullrich (2021): The Lean 4 theorem prover - Formal language
Overall Assessment: This is a proof-of-concept exploratory paper that proposes a valuable new research direction but suffers from severely insufficient experimental validation. As a preprint and preliminary research, it successfully identifies critical problems and provides a viable technical pathway, but remains far from a mature solution. The paper's primary value lies in problem definition and direction guidance rather than technical breakthroughs. Subsequent work should prioritize automating variable grounding and conducting large-scale evaluation.