2025-11-23T10:13:16.980830

Closure Properties of General Grammars -- Formally Verified

Dvorak, Blanchette

We formalized general (i.e., type-0) grammars using the Lean 3 proof assistant. We defined basic notions of rewrite rules and of words derived by a grammar, and used grammars to show closure of the class of type-0 languages under four operations: union, reversal, concatenation, and the Kleene star. The literature mostly focuses on Turing machine arguments, which are possibly more difficult to formalize. For the Kleene star, we could not follow the literature and came up with our own grammar-based construction.

academic

Closure Properties of General Grammars -- Formally Verified

Basic Information

Paper ID: 2302.06420
Title: Closure Properties of General Grammars -- Formally Verified
Authors: Martin Dvorak (Institute of Science and Technology Austria), Jasmin Blanchette (Ludwig-Maximilians-Universität München)
Classification: cs.FL (Formal Languages and Automata Theory)
Conference: 14th International Conference on Interactive Theorem Proving (ITP 2023)
Paper Link: https://arxiv.org/abs/2302.06420

Abstract

This paper formalizes general grammars (i.e., type-0 grammars) using the Lean 3 proof assistant. The authors define fundamental concepts of rewriting rules and word derivation from grammars, and prove that the class of type-0 languages is closed under four operations: union, reversal, concatenation, and Kleene star. The literature primarily focuses on Turing machine arguments, which may be more difficult to formalize. For the Kleene star operation, the authors could not follow the literature and proposed their own grammar-based construction.

Research Background and Motivation

Problem Background

Importance of Formal Language Theory: Formal language concepts are central to computer science and can be recognized through multiple formalisms, including Turing machines and formal grammars
Equivalence of Type-0 Grammars and Turing Machines: Both Turing machines and general grammars characterize the same class of recursively enumerable or type-0 languages
Limitations of Existing Formalization Work: While substantial work exists on formalizing Turing machines in proof assistants, formalization of general grammars remains relatively scarce

Research Motivation

Advantages of Grammars: General grammars are easier to define than Turing machines, and certain proofs about general grammars are considerably simpler than analogous Turing machine proofs
Importance of Closure Properties: Closure properties of type-0 languages are fundamental results in formal language theory
Need for Formal Verification: Machine-checked rigorous proofs are needed to ensure the correctness of these foundational results

Core Contributions

First Complete Formalization of General Grammars: Comprehensive definition of type-0 grammar concepts and operations in Lean 3
Formal Proofs of Four Closure Properties:
- Union closure
- Reversal closure
- Concatenation closure
- Kleene star closure
Novel Kleene Star Construction: Due to the absence of grammar-based constructions in the literature, the authors developed their own syntax-based construction method
Reusable Abstract Framework: Development of the lifted_grammar structure to reduce code duplication and provide generic proof patterns
Open-Source Lean Codebase of Approximately 12,500 Lines: Complete formalization implementation available for community use

Methodology Details

Fundamental Definition Structures

Symbol System

inductive symbol (T : Type) (N : Type)
| terminal : T → symbol  
| nonterminal : N → symbol

Grammar Rule Representation

structure grule (T : Type) (N : Type) :=
( input_L : list (symbol T N))
( input_N : N)  
( input_R : list (symbol T N))
( output_string : list (symbol T N))

Grammar Definition

structure grammar (T : Type) :=
(nt : Type)
(initial : nt)
(rules : list (grule T nt))

Core Operation Definitions

Grammar Transformation Relation

def grammar_transforms (g : grammar T) (w1 w2 : list (symbol T g.nt)) : Prop :=
∃ r : grule T g.nt,
  r ∈ g.rules ∧
  ∃ u v : list (symbol T g.nt),
    w1 = u ++ r.input_L ++ [symbol.nonterminal r.input_N] ++ r.input_R ++ v ∧
    w2 = u ++ r.output_string ++ v

Derivation Relation

def grammar_derives (g : grammar T) : 
  list (symbol T g.nt) → list (symbol T g.nt) → Prop :=
relation.refl_trans_gen (grammar_transforms g)

Technical Innovations

1. Lifted Grammar Abstract Framework

To reduce code duplication, the authors developed an abstract structure:

Contains a smaller grammar g0 and a larger grammar g
Provides lift_nt and sink_nt functions for conversion between different nonterminal types
Ensures injectivity and correctness of corresponding rules

2. Innovative Handling of Concatenation

Traditional context-free grammar concatenation constructions fail for general grammars. The authors' solution:

Creates proxy nonterminals for each terminal
Ensures complete separation of nonterminals used by g1 and g2
Avoids string matching problems across concatenation boundaries

3. Original Kleene Star Construction

Due to the absence of grammar-based constructions in the literature, the authors invented a novel approach:

Introduces separator # to construct isolated "compartments" for words
Uses a cleaner R to traverse from beginning to end and remove separators
New rule set: P* = P ∪ {Z → ZS#, Z → R#, R# → R, R# → ε} ∪ {Rt → tR | t ∈ T}

Experimental Setup

Formalization Environment

Proof Assistant: Lean 3
Mathematical Library: mathlib
Code Scale: Approximately 12,500 lines of well-formatted Lean code
Metaprogramming: Utilizes Lean's metaprogramming framework to develop small-scale automation

Verification Methods

Structural Induction: Uses structural induction on derivation relations for proofs
Case Analysis: Exhaustive case analysis for different rule applications
Invariant Maintenance: Maintains critical invariants in complex proofs

Experimental Results

Main Theorems

Union Closure: theorem T0_of_T0_u_T0 (L1 L2 : language T) : is_T0 L1 ∧ is_T0 L2 → is_T0 (L1 + L2)
Reversal Closure: theorem T0_of_reverse_T0 (L : language T) : is_T0 L → is_T0 (reverse_lang L)
Concatenation Closure: theorem T0_of_T0_c_T0 (L1 L2 : language T) : is_T0 L1 ∧ is_T0 L2 → is_T0 (L1 * L2)
Kleene Star Closure: theorem T0_of_star_T0 (L : language T) : is_T0 L → is_T0 L.star

Proof Complexity Analysis

Union and Reversal: Relatively straightforward, primarily using standard constructions
Concatenation: Moderate complexity, requiring handling of boundary matching issues
Kleene Star: Most complex, with the star_induction lemma alone exceeding 3,000 lines of code

Byproducts

Context-Free Grammars: Proofs can be reused to establish closure properties of context-free languages
Concatenation Lemmas: theorem CF_of_CF_u_CF and similar results can be directly applied to context-free languages

Grammar Formalization

Context-Free Grammars: Carlson et al. (Mizar), Minamide (Isabelle/HOL), Barthwal and Norrish (HOL4), Firsov and Uustalu (Agda), Ramos (Coq)
General Grammars: This paper represents the first complete formalization

Other Computational Models

Finite Automata: Thompson and Dillies (Lean), regular expression formalization
Turing Machines: Implementations in multiple proof assistants, with recent work by Balbach even proving the Cook-Levin theorem
Lambda Calculus: Norrish (HOL4), Forster et al. (Coq)
Partial Recursive Functions: Norrish (HOL4), Carneiro (Lean)

Conclusions and Discussion

Main Conclusions

Grammars Superior to Turing Machines: For closure property proofs, grammars may be more convenient than Turing machines
Feasibility of Formalization: Complex formal language theory results can be successfully formalized in modern proof assistants
Importance of Abstraction: Good abstractions (such as lifted_grammar) are crucial for large-scale formalization

Limitations

Complexity Classes: Grammars cannot define important complexity classes (such as P), still requiring models like Turing machines
Constructivity: The authors did not attempt to make the development constructive
Intersection Closure: Intersection closure was not formalized due to lack of knowledge of elegant grammar-based constructions

Future Directions

Complete Chomsky Hierarchy: Incorporate context-sensitive, context-free, and regular grammars into the library
Equivalence Proofs: Attempt to prove equivalence between general grammars and Turing machines
Integration with mathlib: Connect mathlib results on regular languages to this library

In-Depth Evaluation

Strengths

Pioneering Work: First complete formalization of general grammars, filling an important gap
Technical Innovation: The original Kleene star construction demonstrates profound theoretical expertise
Engineering Quality: 12,500 lines of high-quality code with excellent abstract design
Theoretical Contribution: Demonstrates that grammar-based approaches are superior to Turing machine methods in certain cases
Reusability: Abstractions like lifted_grammar provide foundations for subsequent work

Weaknesses

Limited Expressiveness: Cannot handle important concepts in complexity theory
Missing Constructivity: Does not address requirements of constructive mathematics
Incompleteness: Lacks treatment of important operations such as intersection
Documentation: Explanations of certain technical details could be more thorough

Impact

Theoretical Significance: Establishes foundations for machine verification of formal language theory
Methodological Value: Demonstrates best practices for large-scale formalization projects
Community Contribution: Open-source codebase will facilitate related research
Educational Value: Serves as an excellent case study for learning formalization methods

Applicable Scenarios

Theoretical Computer Science: Research in language theory and automata theory
Formal Mathematics: Mathematical research requiring rigorous proofs
Compiler Verification: Correctness assurance for parsing and language processing
Education: Supplementary materials for formal language courses

References

The paper cites 26 important references, covering:

Classical textbooks: Aho & Ullman's parsing theory, Hopcroft et al.'s automata theory
Formalization work: Implementations of computational models in various proof assistants
Tool documentation: Relevant materials for Lean 3 and mathlib

Summary: This is a high-quality theoretical computer science paper that makes important technical contributions and provides valuable experience in formalization methodology. The authors' work establishes a solid foundation for constructing a complete formalized Chomsky hierarchy and holds significant value for both the formal language theory community and the proof assistant community.