2025-11-22T16:49:15.454007

Towards Autoformalization of LLM-generated Outputs for Requirement Verification

Gupte, S

Autoformalization, the process of translating informal statements into formal logic, has gained renewed interest with the emergence of powerful Large Language Models (LLMs). While LLMs show promise in generating structured outputs from natural language (NL), such as Gherkin Scenarios from NL feature requirements, there's currently no formal method to verify if these outputs are accurate. This paper takes a preliminary step toward addressing this gap by exploring the use of a simple LLM-based autoformalizer to verify LLM-generated outputs against a small set of natural language requirements. We conducted two distinct experiments. In the first one, the autoformalizer successfully identified that two differently-worded NL requirements were logically equivalent, demonstrating the pipeline's potential for consistency checks. In the second, the autoformalizer was used to identify a logical inconsistency between a given NL requirement and an LLM-generated output, highlighting its utility as a formal verification tool. Our findings, while limited, suggest that autoformalization holds significant potential for ensuring the fidelity and logical consistency of LLM-generated outputs, laying a crucial foundation for future, more extensive studies into this novel application.

academic

Towards Autoformalization of LLM-generated Outputs for Requirement Verification

Basic Information

Paper ID: 2511.11829
Title: Towards Autoformalization of LLM-generated Outputs for Requirement Verification
Authors: Mihir Gupte, Ramesh S. (General Motors)
Classification: cs.CL, cs.AI, cs.FL, cs.LO
Publication Date: November 18, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.11829

Abstract

This paper explores the feasibility of using autoformalization techniques to verify the accuracy of outputs generated by large language models (LLMs). As LLMs demonstrate potential in converting natural language requirements into structured outputs (such as Gherkin scenarios), the critical question of how to formally verify the accuracy of these outputs becomes paramount. The research team conducted two sets of experiments: the first successfully identified logical equivalence between two differently-worded natural language requirements; the second identified logical inconsistencies between LLM-generated outputs and original requirements. Although the research scope is limited, the results demonstrate significant potential for autoformalization in ensuring the fidelity and logical consistency of LLM-generated outputs.

Research Background and Motivation

1. Core Problem

With the proliferation of LLM applications, particularly in automated generation of test scenarios for engineering tasks, a critical challenge emerges: the lack of formal methods to verify whether LLM-generated outputs accurately reflect the original natural language requirements. For example, after generating a Gherkin scenario from a requirement such as "When vehicle speed ≥ 10 and seatbelt is unfastened, initiate seatbelt reminder," there is no guarantee of the logical correctness of the generated content.

2. Problem Significance

In safety-critical domains such as automotive engineering, requirement verification is crucial. Incorrect requirement translation can lead to:

Incomplete or erroneous test cases
System behavior deviating from expectations
Potential safety hazards

3. Limitations of Existing Approaches

Traditional Formal Methods: Primarily applied to mathematical theorem proving, lacking applications in engineering requirement verification
LLM Evaluation Methods: Rely on manual inspection or heuristic approaches, lacking rigorous logical verification
Autoformalization Research: Primarily focused on dataset construction and model training, with limited attention to practical engineering applications

4. Research Motivation

This paper proposes applying autoformalization techniques to a novel scenario: verifying LLM outputs by converting both natural language requirements and LLM-generated outputs into formal logic representations (Lean 4) and utilizing theorem provers to verify logical equivalence.

Core Contributions

Proposed the first autoformalization pipeline specifically designed for verifying LLM-generated outputs: Converting natural language requirements and LLM outputs into Lean 4 formal representations and verifying logical consistency through biconditional equivalence proofs
Validated two concrete application scenarios:
- Identifying logical equivalence of differently-worded natural language requirements
- Detecting logical inconsistencies between LLM-generated outputs and original requirements
Identified critical technical challenges:
- The necessity and automation difficulties of variable grounding
- The impact of LLM non-determinism on formalization
- Handling natural language ambiguity
Proposed future research directions: Laying the foundation for constructing reliable LLM output verification frameworks

Methodology in Detail

Task Definition

Input:

A pair of statements requiring verification of logical relationships (S₁, S₂), which can be:
- Two natural language requirements
- One natural language requirement and one LLM-generated Gherkin scenario

Output:

Logical equivalence judgment: equivalent (provable S₁ ↔ S₂) or non-equivalent (proof failure)

Constraints:

Statements must be formalizable as propositional logic
System context information is required for variable grounding

Model Architecture

The overall pipeline comprises four key steps (as shown in Figure 9):

Step 1: Autoformalization

Using DeepSeek-Prover-v2 (7B model) to convert natural language statements into Lean 4 propositions:

-- Example: Formalization of requirement R1
variable (vehicle_speed_avg : ℝ)
variable (calibratable_speed : ℝ)
variable (seatbelt_active : Bool)
variable (initiate_chime : Bool)

def seatbelt_chime_condition : Prop :=
  (vehicle_speed_avg ≥ calibratable_speed ∨ ¬seatbelt_active) → initiate_chime

Prompt Template (see Appendix A.1):

Explicitly requires generation of Lean 4 def statements
Specifies use of propositional logic (Prop type)
Requires expressing conditions as implications (A ∧ B → C)

Step 2: Variable Grounding

Current Implementation: Manual identification and unification of variables in different formalizations that refer to the same concept

Problem Example:

vehicle_speed_avg in R1 and mean_vehicle_speed in R2 refer to the same physical quantity
Must explicitly inform the Lean compiler of this equivalence

-- Manual grounding example
(h_speed : vehicle_speed_avg = mean_vehicle_speed)
(h_belt : seatbelt_active = seatbelt_plugged_in)

Step 3: Biconditional Equivalence Theorem Construction

Constructing a formal theorem to verify logical equivalence:

theorem req1_eq_req2 
  (h_grounding : <variable grounding assumptions>) :
  (Proposition_A) ↔ (Proposition_B) := by
  <proof process>

Step 4: Automated Theorem Proving

Using DeepSeek-Prover-v2 again to generate Lean proof code:

Successful proof → Two statements are logically equivalent
Proof failure → Logical inconsistency exists

Technical Innovations

Cross-domain Application Innovation: First application of autoformalization techniques from mathematical theorem proving to software engineering requirement verification
Dual-layer LLM Usage:
- First layer: Formalization translation (NL → Lean)
- Second layer: Theorem proving (equivalence verification)
Failure-based Verification Mechanism: Utilizing theorem prover failure as an indicator of logical inconsistency, representing an innovative negative verification approach
Variable Grounding Identification: Explicitly proposing that variable grounding is an indispensable step in the autoformalization pipeline, which has not been adequately emphasized in prior research

Experimental Setup

Dataset

Experiment 1: Requirement Equivalence Verification

R1: "If Vehicle Speed Average Driven ≥ CALIBRATABLE Seatbelt Reminder Speed OR Seatbelt is inactive then initiate SeatBelt Chime"
R2: "If mean vehicle speed is greater than CALIBRATABLE Seatbelt Reminder Speed OR Seatbelt is not plugged in then initiate chime for seatbelt"

Experiment 2: LLM Output Verification

R3: "When Front Passenger Seat Belt Status changes to 'Fastened' then the Front Passenger Seat Belt Reminder Indication On shall be set to FALSE."
G3: LLM-generated Gherkin scenario (containing 4 example rows, introducing additional variables such as seat_occupancy)

Evaluation Metrics

Qualitative Metrics:

Lean 4 compilation success/failure
Theorem proof success/failure

Verification Criteria:

Logical Equivalence: Theorem PA ↔ PB is provable
Logical Inconsistency: Theorem proof fails or Lean code fails to compile

Implementation Details

Model Selection:

DeepSeek-Prover-v2 (7B)
Selection Rationale:
1. Trained on Lean 4
2. Possesses natural language understanding capabilities
3. Employs subgoal decomposition strategy

Formal Language: Lean 4

Powerful theorem proving capabilities
Precise logical expression
Compatible with DeepSeek-Prover-v2

Manual Intervention:

Variable grounding step entirely manual
Formalization output verification depends on Lean compiler

Experimental Results

Main Results

Experiment 1: Requirement Equivalence Verification (Success Case)

Formalization Output:

R1 and R2 successfully converted to Lean propositions (Figures 3, 4)
Variable mappings manually established:
- vehicle_speed_avg ≡ mean_vehicle_speed
- seatbelt_active ≡ seatbelt_plugged_in

Verification Results (Figure 5):

✅ Lean compilation successful
✅ Theorem req1_eq_req2 proof successful
Conclusion: R1 and R2 are logically equivalent

Significance: Demonstrates that the pipeline can identify semantically identical but differently-worded requirements, facilitating requirement consistency checking.

Experiment 2: LLM Output Verification (Failure Case)

Formalization Output:

R3: Simple proposition containing only seatbelt status change condition (Figure 6)
G3: Complex proposition containing additional variables (seat_occupancy, initial_seatbelt_status) (Figure 7)

Key Finding:

G3 introduces variables not mentioned in R3
More complex logical structure (containing 4 scenario examples)

Verification Results (Figure 8):

❌ Lean code compilation failed or proof failed
Conclusion: G3 is logically inconsistent with R3

Significance: Successfully detected the "over-generation" problem in LLM outputs—adding constraints not present in the original requirement.

Case Analysis

Case: Ambiguity Problem (R4)

Requirement:

"When the seatbelt is Unfastened and the vehicle is in motion then the Front Passenger Seat Belt Reminder Indication On shall be set to TRUE."

Problem: Ambiguity in formalizing "vehicle in motion"

Two Possible Formalizations (Figure 10):

pass@1: Boolean variable vehicle_in_motion : Bool
pass@2: Numerical variable vehicle_speed > 0

Analysis:

Both formalizations may be semantically correct for the system
However, they are not equivalent in formal logic (different types)
Highlights the impact of natural language ambiguity on formalization

Experimental Findings

Formalization Depends on Interpretability: LLM non-determinism causes the same requirement to potentially produce different but equally "reasonable" formalizations
Variable Grounding is a Bottleneck:
- Most time-consuming step
- Requires system domain knowledge
- Currently only achievable manually
System Context is Critical: Lack of explicit system definitions (such as data dictionaries) leads to formalization inconsistencies
Negative Verification is Effective: Proof failure effectively indicates logical inconsistency

Autoformalization Dataset Construction

ProofNet: Undergraduate-level mathematics autoformalization
MiniF2F: Olympiad-level mathematics benchmark
Multilingual Datasets: Combined training on Lean, Isabelle, and Coq improves performance

LLM Training Strategies

"Draft-Sketch-Prove" Method (Jiang et al.): Generating proof outlines to guide formalization
Subgoal Decomposition: Recursive search strategy employed by DeepSeek-Prover
Reinforcement Learning: Improves theorem proving success rates

Addressing Non-determinism

Symbolic Equivalence Framework: Handling differences between pass@1 and pass@k
RAG Methods: Retrieving precise formal definitions to provide context

Application Domain Expansion

Mathematical Problem Solving: High school to undergraduate mathematics
Code Verification: Program correctness verification
Biomedical: Fact verification

This Paper's Contribution: First application of autoformalization to engineering requirement verification and LLM output verification, opening new application directions.

Conclusions and Discussion

Main Conclusions

Feasibility Verification: The autoformalization pipeline can effectively verify logical consistency between LLM-generated outputs and original requirements
Dual Application Value:
- Requirement consistency checking (identifying equivalent requirements)
- LLM output quality control (detecting logical errors)
Proof of Concept: Although limited in scope, successfully demonstrates the potential of applying theorem proving techniques to software engineering

Limitations

Scale Limitations:
- Only 3 requirement pairs tested
- Lack of large-scale evaluation
- No statistical significance analysis
Manual Dependencies:
- Variable grounding entirely manual
- Time-consuming and error-prone
- Limits scalability
Model Limitations:
- Uses 7B model (resource-constrained)
- Larger models (671B) may perform better
- Formalization quality depends on model capability
Ambiguity Unresolved:
- Inherent ambiguity in natural language
- Lack of system ontology support
- May produce multiple "correct" but non-equivalent formalizations
Domain Specificity:
- Experiments limited to automotive safety requirements
- Generalization capability unknown

Future Directions

The paper proposes three key research questions (RQs):

RQ1: Optimal Formalization Method

Explore k-pass strategies (generating multiple formalizations, selecting the most likely)
Compare single-pass conversion with multiple sampling accuracy

RQ2: Automating Variable Grounding

Method 1: Context-aware single-pass formalization (handling all statements in a single call)
Method 2: Similarity-based grounding (using embeddings to match system ontology)
Challenge: How to verify the correctness of LLM's grounding assumptions themselves

RQ3: LLM Verification in Constrained Systems

Building knowledge graphs of system actions
Developing LLM traversal mechanisms
Using autoformalization to ensure output logical consistency
Application Scenarios: Smart homes, in-vehicle assistants, and other limited-action-space systems

Other Directions:

Developing automated variable normalization techniques
Integrating domain-specific ontologies (such as automotive system data dictionaries)
Extending to more complex logical representations (such as temporal logic)

In-Depth Evaluation

Strengths

Novel Problem Definition:
- First systematic application of autoformalization to LLM output verification
- Addresses real pain points in engineering practice
- Opens new research directions
Clear Methodology:
- Concise and well-designed pipeline
- Leverages mature tools (Lean 4, DeepSeek-Prover)
- Strong reproducibility
Insightful Problem Identification:
- Clearly identifies the criticality of variable grounding
- In-depth analysis of ambiguity problems
- Honest discussion of LLM non-determinism impacts
High Practical Value:
- Targets safety-critical domains (automotive engineering)
- Directly applicable to requirement engineering workflows
- Enhances trustworthiness of LLM applications
Excellent Writing Quality:
- Clear structure and rigorous logic
- Detailed prompt templates provided (appendix)
- Rich figures and diagrams for easy understanding

Weaknesses

Severely Insufficient Experimental Scale:
- Only 3 samples: Cannot draw statistical conclusions
- Lacks testing on requirements of different domains and complexity levels
- No evaluation of performance on larger datasets
- Recommendation: At least 50-100 requirement pairs needed for adequate evaluation
Lack of Quantitative Evaluation:
- No accuracy, recall, or other metrics reported
- Missing formalization success rate reporting
- Lacks comparison with manual verification
- Recommendation: Establish annotated dataset and compute precision metrics
Excessive Manual Intervention:
- Variable grounding entirely manual, undermining automation claims
- No concrete implementation of automated solutions
- Scalability questionable
- Recommendation: Implement at least a prototype automated grounding system
Limited Model Selection:
- Only 7B model used due to resource constraints
- No exploration of 671B model or alternatives
- Missing analysis of model size impact on results
- Recommendation: Compare different models on at least a subset of samples
Lack of Failure Case Analysis:
- No detailed analysis of theorem proof failures
- Fails to distinguish between formalization errors vs. real logical inconsistencies
- Missing false positive/negative analysis
- Recommendation: Establish error classification taxonomy
Single Evaluation Standard:
- Relies solely on Lean compilation success/failure
- Doesn't consider partially correct cases
- Lacks fine-grained error type analysis
Unknown Generalization Capability:
- Only tests automotive safety requirements
- Hasn't verified applicability to other domains (medical, financial, etc.)
- Gherkin scenario specificity may limit method generalizability

Impact

Contribution to the Field:

⭐⭐⭐⭐ Conceptual Contribution High: Proposes new research direction and application scenario
⭐⭐ Technical Contribution Moderate: Primarily combines existing techniques
⭐⭐⭐ Practical Value Relatively High: Addresses real engineering problems

Practical Value:

Short-term: Provides requirement engineers with verification insights
Mid-term: May catalyze development of specialized requirement verification tools
Long-term: May become standard practice for LLM application quality assurance

Reproducibility:

✅ Uses open-source tools (Lean 4, DeepSeek-Prover)
✅ Provides detailed prompt templates
❌ Code and data not released
❌ Manual steps difficult to reproduce
Rating: ⭐⭐⭐ (Moderate)

Expected Impact:

Likely to stimulate more research on formal verification of LLM outputs
May promote integration of requirement engineering with formal methods
Variable grounding problem likely to become a new research focus

Applicable Scenarios

Highly Applicable:

✅ Requirement verification in safety-critical systems (automotive, aviation, medical)
✅ Requirement consistency checking and deduplication
✅ Quality control for LLM-generated test cases

Moderately Applicable:

⚠️ Complex business logic verification (requires extended formalization capabilities)
⚠️ Multi-modal requirements (e.g., requirements containing diagrams)
⚠️ Real-time systems (requires temporal logic extensions)

Not Applicable:

❌ Highly ambiguous natural language text
❌ Scenarios lacking explicit system definitions
❌ Real-time response scenarios (current manual steps too slow)

Improvement Recommendations

Immediately Feasible:
- Expand to at least 50 requirement pairs
- Implement basic automated variable grounding prototype
- Establish error classification taxonomy
Short-term Improvements:
- Compare different model sizes
- Introduce quantitative evaluation metrics
- Test across multiple domains
Long-term Goals:
- Fully automated pipeline
- Integration with domain knowledge graphs
- Support for temporal logic and complex constraints

Key References

Weng et al. (2025): Autoformalization in the era of large language models: A survey - Survey literature
Wu et al. (2022): Autoformalization with large language models - Foundational autoformalization work
Ren et al. (2025): DeepSeek-Prover-v2 - Core model used in this paper
Jiang et al. (2022): Draft, sketch, and prove - Subgoal decomposition method
de Moura & Ullrich (2021): The Lean 4 theorem prover - Formal language

Overall Assessment: This is a proof-of-concept exploratory paper that proposes a valuable new research direction but suffers from severely insufficient experimental validation. As a preprint and preliminary research, it successfully identifies critical problems and provides a viable technical pathway, but remains far from a mature solution. The paper's primary value lies in problem definition and direction guidance rather than technical breakthroughs. Subsequent work should prioritize automating variable grounding and conducting large-scale evaluation.